2025-05-27-12-16

Implementing Agents in JavaScript

Abstract

arXiv:2505.18228v1 Announce Type: new Abstract: This chapter gives an introduction to agent-oriented programming in JavaScript. It provides an example-based walk-through of how to implement abstractions for reasoning loop agents in vanilla JavaScript. The initial example is used as a stepping stone for explaining how to implement slightly more advanced agents and multi-agent systems using JS-son, a JavaScript library for agent-oriented programming. In this context, the chapter also explains how to integrate reasoning loop agents with generative AI technologies--specifically, large language models. Finally, application scenarios in several technology ecosystems and future research directions are sketched.

摘要

本章介绍了JavaScript中的面向智能体编程方法。通过示例演示了如何在原生JavaScript中实现推理循环智能体的抽象概念。初始示例作为基础，进一步阐释了如何运用JS-son（一个面向智能体编程的JavaScript库）来实现更高级的智能体与多智能体系统。在此背景下，本章还探讨了如何将推理循环智能体与生成式人工智能技术——特别是大语言模型——进行整合。最后，概述了该技术在多个技术生态系统中的应用场景及未来研究方向。

An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

Abstract

arXiv:2505.18397v1 Announce Type: new Abstract: Multi-agent AI systems (MAS) offer a promising framework for distributed intelligence, enabling collaborative reasoning, planning, and decision-making across autonomous agents. This paper provides a systematic outlook on the current opportunities and challenges of MAS, drawing insights from recent advances in large language models (LLMs), federated optimization, and human-AI interaction. We formalize key concepts including agent topology, coordination protocols, and shared objectives, and identify major risks such as dependency, misalignment, and vulnerabilities arising from training data overlap. Through a biologically inspired simulation and comprehensive theoretical framing, we highlight critical pathways for developing robust, scalable, and secure MAS in real-world settings.

摘要

多智能体人工智能系统（MAS）为分布式智能提供了一个前景广阔的框架，能够实现自主智能体间的协同推理、规划与决策。本文基于大语言模型（LLMs）、联邦优化和人机交互领域的最新进展，系统性地阐述了当前MAS面临的机遇与挑战。我们形式化定义了智能体拓扑结构、协调协议和共享目标等关键概念，并识别出训练数据重叠导致的依赖性、目标偏差和系统脆弱性等主要风险。通过仿生学模拟实验和综合理论框架，我们重点探讨了在现实场景中开发鲁棒、可扩展且安全的MAS的关键路径。

Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark

Abstract

arXiv:2505.18467v1 Announce Type: new Abstract: Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs' pedagogical strengths and limitations.

摘要

大规模推理模型（LRMs）近期在数学和编程等结构化领域展现出卓越性能，但其往往缺乏教学连贯性与真实教学行为。为弥合这一差距，我们提出Pedagogy-R1框架，通过三项创新实现LRMs的课堂适配：（1）基于蒸馏的流程，对模型输出进行教学调优的筛选与精炼；（2）均衡教育基准（WBEB），从学科知识、教学法知识、学习轨迹追踪、论文评分及教师决策五个维度评估性能；（3）教学链（CoP）提示策略，用于生成和引导教师风格推理。我们采用混合方法评估，结合量化指标与质性分析，首次系统评估了LRMs的教学优势与局限。

Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary

Abstract

arXiv:2505.18325v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries-a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models'safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios.We have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets will be released at https://anonymous.4open.science/r/RASS-80D3.

摘要

大语言模型（LLMs）在广泛的任务中展现出卓越能力，却经常拒绝回答合理查询——这种现象称为过度拒绝。过度拒绝通常源于过度保守的安全对齐机制，导致模型将许多合理提示视为潜在风险。为系统研究该问题，我们通过探测并利用模型的安全决策边界来分析及缓解过度拒绝。研究发现，过度拒绝与边界区域的错位密切相关，这些区域中模型难以区分良性内容与有害内容的细微差异。基于此，我们提出RASS框架——一种针对安全边界附近过度拒绝提示的自动化生成与选择策略。通过利用表征空间的导向向量，RASS高效识别并筛选边界对齐提示，实现更精准定向的过度拒绝缓解。该方法不仅为模型安全决策提供了更精确可解释的视角，还能无缝扩展至多语言场景。我们探索了多种LLMs的安全决策边界，并构建MORBench评估集以促进跨语言模型安全性与实用性的稳健评估。代码与数据集将在https://anonymous.4open.science/r/RASS-80D3发布。

Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation

Abstract

arXiv:2505.18351v1 Announce Type: new Abstract: Despite advances in designing personas for Large Language Models (LLM), challenges remain in aligning them with human cognitive processes and representing diverse stakeholder perspectives. We introduce a Social Cognitive Theory (SCT) agent design framework for designing, evaluating, and implementing psychologically grounded LLMs with consistent behavior. Our framework operationalizes SCT through four personal factors (cognitive, motivational, biological, and affective) for designing, six quantifiable constructs for evaluating, and a graph database-backed architecture for implementing stakeholder personas. Experiments tested agents' responses to contradicting information of varying reliability. In the highly polarized renewable energy transition discourse, we design five diverse agents with distinct ideologies, roles, and stakes to examine stakeholder representation. The evaluation of these agents in contradictory scenarios occurs through comprehensive processes that implement the SCT. Results show consistent response patterns ( $R^2$ range: $0.58-0.61$ ) and systematic temporal development of SCT construct effects. Principal component analysis identifies two dimensions explaining $73$ % of variance, validating the theoretical structure. Our framework offers improved explainability and reproducibility compared to black-box approaches. This work contributes to ongoing efforts to improve diverse stakeholder representation while maintaining psychological consistency in LLM personas.

摘要

尽管在设计大型语言模型（LLM）角色方面取得了进展，但在使其与人类认知过程保持一致及呈现多元利益相关者视角方面仍存在挑战。我们提出一种基于社会认知理论（SCT）的智能体设计框架，用于设计、评估和实现具有行为一致性的心理学基础LLM。该框架通过四大个人因素（认知、动机、生物和情感）进行设计，六个可量化构念进行评估，并采用图数据库支撑的架构来实现利益相关者角色建模。实验测试了智能体对不同可靠性矛盾信息的响应。在高度两极化的可再生能源转型讨论中，我们设计了五个具有不同意识形态、角色和利益诉求的多样化智能体，以检验利益相关者表征效果。通过实施SCT的综合流程，对这些智能体在矛盾情境中的表现进行评估。结果显示出一致的响应模式（R²范围：0.58-0.61）以及SCT构念效应的系统性时序发展。主成分分析识别出两个解释73%方差的维度，验证了理论结构。相较于黑箱方法，本框架提供了更好的可解释性和可复现性。这项工作为在保持LLM角色心理一致性的同时提升多元利益相关者表征能力的研究做出了贡献。

Single-agent or Multi-agent Systems? Why Not Both?

Abstract

arXiv:2505.18286v1 Announce Type: new Abstract: Multi-agent systems (MAS) decompose complex tasks and delegate subtasks to different large language model (LLM) agents and tools. Prior studies have reported the superior accuracy performance of MAS across diverse domains, enabled by long-horizon context tracking and error correction through role-specific agents. However, the design and deployment of MAS incur higher complexity and runtime cost compared to single-agent systems (SAS). Meanwhile, frontier LLMs, such as OpenAI-o3 and Gemini-2.5-Pro, have rapidly advanced in long-context reasoning, memory retention, and tool usage, mitigating many limitations that originally motivated MAS designs. In this paper, we conduct an extensive empirical study comparing MAS and SAS across various popular agentic applications. We find that the benefits of MAS over SAS diminish as LLM capabilities improve, and we propose efficient mechanisms to pinpoint the error-prone agent in MAS. Furthermore, the performance discrepancy between MAS and SAS motivates our design of a hybrid agentic paradigm, request cascading between MAS and SAS, to improve both efficiency and capability. Our design improves accuracy by 1.1-12% while reducing deployment costs by up to 20% across various agentic applications.

摘要

多智能体系统（MAS）通过将复杂任务分解并分配给不同的大语言模型（LLM）智能体与工具来实现任务处理。先前研究表明，得益于角色专属智能体的长程上下文追踪与错误修正能力，MAS在多个领域展现出卓越的准确性。然而相较于单智能体系统（SAS），MAS的设计与部署具有更高的复杂性和运行时成本。与此同时，前沿LLM（如OpenAI-o3和Gemini-2.5-Pro）在长上下文推理、记忆保持和工具使用方面快速进步，消解了许多最初促使MAS设计的局限性。本文通过大量实证研究对比了MAS与SAS在各类主流智能体应用中的表现，发现随着LLM能力的提升，MAS相对于SAS的优势逐渐减弱，并提出高效机制以定位MAS中易出错的智能体。此外，MAS与SAS的性能差异促使我们设计出一种混合智能体范式——在MAS与SAS之间进行请求级联，以同步提升效率与能力。该设计在各类智能体应用中实现1.1-12%的准确率提升，同时降低最高达20%的部署成本。

RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification

Abstract

arXiv:2505.18380v1 Announce Type: new Abstract: Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key design desiderata, de-identification and relexicalization methodology, and modular architecture of RedactX and its integration with the Oracle Health Clinical AI system. Evaluated on the i2b2 2014 De-ID dataset using standard metrics with strict recall, our approach achieves competitive performance while optimizing token usage to reduce LLM costs. Finally, we discuss key lessons and insights from deployment in real-world AI- driven healthcare data pipelines.

摘要

确保临床数据隐私同时保持其实用性，对于AI驱动的医疗保健和数据分析至关重要。现有的去标识化（De-ID）方法，包括基于规则的技术、深度学习模型和大语言模型（LLMs），常存在召回错误、泛化能力有限和效率低下等问题，限制了其实际应用。我们提出了一种全自动多模态框架RedactOR，用于对结构化和非结构化电子健康记录（包括临床音频记录）进行去标识化处理。该框架采用高性价比的去标识化策略，包括智能路由、基于混合规则与LLM的方法，以及两步式音频脱敏方法。我们提出了一种基于检索的实体重词汇化方法，以确保对受保护实体进行一致替换，从而增强下游应用的数据连贯性。本文详细阐述了关键设计需求、去标识化与重词汇化方法、RedactX的模块化架构及其与Oracle Health临床AI系统的集成。在i2b2 2014 De-ID数据集上采用严格召回标准进行评估，我们的方法在优化令牌使用以降低LLM成本的同时，取得了具有竞争力的性能。最后，我们讨论了在实际AI驱动的医疗数据管道部署过程中获得的重要经验与见解。

A Survey of LLM $\times$ DATA

Abstract

arXiv:2505.18458v1 Announce Type: new Abstract: The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.

摘要

大语言模型（LLM）与数据管理（DATA）的融合正在迅速重塑这两个领域。本综述全面审视了双向关系。一方面，DATA4LLM涵盖大规模数据处理、存储与服务，为LLM的预训练、后训练、检索增强生成和智能体工作流等阶段提供高质量、多样化和时效性的数据支持：（i）面向LLM的数据处理包括可扩展采集、去重、过滤、选择、领域混合和合成增强；（ii）LLM数据存储聚焦高效数据与模型格式、分布式异构存储层次、KV缓存管理和容错检查点；（iii）LLM数据服务应对RAG（如知识后处理）、LLM推理（如提示压缩、数据溯源）和训练策略（如数据打包与混洗）等挑战。另一方面，在LLM4DATA中，LLM正成为数据管理的通用引擎。我们梳理了最新进展：（i）数据操作，包括自动化数据清洗、集成与发现；（ii）数据分析，涵盖结构化、半结构化和非结构化数据的推理；（iii）系统优化（如配置调优、查询重写、异常诊断），这些进步得益于检索增强提示、任务专用微调与多智能体协作等LLM技术。

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

Abstract

arXiv:2505.18517v1 Announce Type: new Abstract: Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

摘要

基于大型语言模型（LLM）的基础模型在处理多种任务和模态方面已展现出卓越成效。然而，由于声学环境的差异和任务多样性，将这些模型适配于通用音频-语言任务仍具挑战性。本研究提出LiSTEN（面向神经音频LLM的可学习软令牌嵌入框架），通过动态提示选择策略与可学习的键值对，使LLM能够适应语音和音频任务。该框架既能平衡通用知识与任务特定知识，又可避免多任务场景下的过拟合问题。我们的方法降低了对大规模自动语音识别或字幕数据集的依赖，以更少的可训练参数实现竞争性性能，并采用单阶段训练流程简化训练过程。此外，LiSTEN通过分析不同任务间所选提示的多样性与重叠性，增强了模型的可解释性。

MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs

Abstract

arXiv:2505.18530v1 Announce Type: new Abstract: Medical Large Vision-Language Models (Med-LVLMs) have been widely adopted for medical report generation. Despite Med-LVLMs producing state-of-the-art performance, they exhibit a bias toward predicting all findings as normal, leading to reports that overlook critical abnormalities. Furthermore, these models often fail to provide comprehensive descriptions of radiologically relevant regions necessary for accurate diagnosis. To address these challenges, we proposeMedical Report Generation Agents (MRGAgents), a novel multi-agent framework that fine-tunes specialized agents for different disease categories. By curating subsets of the IU X-ray and MIMIC-CXR datasets to train disease-specific agents, MRGAgents generates reports that more effectively balance normal and abnormal findings while ensuring a comprehensive description of clinically relevant regions. Our experiments demonstrate that MRGAgents outperformed the state-of-the-art, improving both report comprehensiveness and diagnostic utility.

摘要

医学大型视觉语言模型（Med-LVLMs）已被广泛应用于医学报告生成。尽管Med-LVLMs表现出最先进的性能，但它们存在将所有检查结果预测为正常的倾向，导致生成的报告忽略关键异常。此外，这些模型往往未能提供准确诊断所需的放射学相关区域的全面描述。为解决这些问题，我们提出医学报告生成代理（MRGAgents），这是一种新颖的多代理框架，通过微调针对不同疾病类别的专用代理。通过筛选IU X-ray和MIMIC-CXR数据集的子集以训练疾病特异性代理，MRGAgents生成的报告能更有效地平衡正常与异常结果，同时确保对临床相关区域的全面描述。实验表明，MRGAgents在报告全面性和诊断实用性方面均优于现有最先进方法。

Retrieval Augmented Decision-Making: A Requirements-Driven, Multi-Criteria Framework for Structured Decision Support

Abstract

arXiv:2505.18483v1 Announce Type: new Abstract: Various industries have produced a large number of documents such as industrial plans, technical guidelines, and regulations that are structurally complex and content-wise fragmented. This poses significant challenges for experts and decision-makers in terms of retrieval and understanding. Although existing LLM-based Retrieval-Augmented Generation methods can provide context-related suggestions, they lack quantitative weighting and traceable reasoning paths, making it difficult to offer multi-level and transparent decision support. To address this issue, this paper proposes the RAD method, which integrates Multi-Criteria Decision Making with the semantic understanding capabilities of LLMs. The method automatically extracts key criteria from industry documents, builds a weighted hierarchical decision model, and generates structured reports under model guidance. The RAD framework introduces explicit weight assignment and reasoning chains in decision generation to ensure accuracy, completeness, and traceability. Experiments show that in various decision-making tasks, the decision reports generated by RAD significantly outperform existing methods in terms of detail, rationality, and structure, demonstrating its application value and potential in complex decision support scenarios.

摘要

各行业产生了大量结构复杂、内容零散的工业规划、技术指南和法规文件，这给专家和决策者的检索与理解带来重大挑战。尽管现有基于大语言模型的检索增强生成方法能提供上下文相关建议，但缺乏定量权重和可追溯的推理路径，难以提供多层次、透明的决策支持。针对该问题，本文提出融合多准则决策与大语言模型语义理解能力的RAD方法，该方法能自动从行业文档中提取关键准则，构建加权层次化决策模型，并在模型指导下生成结构化报告。RAD框架在决策生成中引入显式权重分配和推理链，确保准确性、完整性和可追溯性。实验表明，在各类决策任务中，RAD生成的决策报告在细节性、合理性和结构性方面显著优于现有方法，展现了其在复杂决策支持场景中的应用价值与潜力。

RoleRAG: Enhancing LLM Role-Playing via Graph Guided Retrieval

Abstract

arXiv:2505.18541v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown promise in character imitation, enabling immersive and engaging conversations. However, they often generate content that is irrelevant or inconsistent with a character's background. We attribute these failures to: (1) the inability to accurately recall character-specific knowledge due to entity ambiguity, and (2) a lack of awareness of the character's cognitive boundaries. To address these issues, we propose RoleRAG, a retrieval-based framework that integrates efficient entity disambiguation for knowledge indexing with a boundary-aware retriever for extracting contextually appropriate information from a structured knowledge graph. Experiments on role-playing benchmarks show that RoleRAG's calibrated retrieval helps both general-purpose and role-specific LLMs better align with character knowledge and reduce hallucinated responses.

摘要

大语言模型（LLMs）在角色模仿方面展现出潜力，能够实现沉浸式且引人入胜的对话。然而，其生成内容常出现与角色背景无关或不一致的问题。我们将这些缺陷归因于：（1）因实体歧义而无法准确回忆角色特定知识；（2）缺乏对角色认知边界的意识。为解决这些问题，我们提出RoleRAG框架，该检索增强方法结合了高效实体消歧的知识索引技术，以及边界感知检索器，用于从结构化知识图谱中提取符合上下文的信息。角色扮演基准测试表明，RoleRAG的校准检索机制能帮助通用型和角色专用型LLMs更好地对齐角色知识，并减少幻觉响应。

Seeing Beyond Words: MatVQA for Challenging Visual-Scientific Reasoning in Materials Science

Abstract

arXiv:2505.18319v1 Announce Type: new Abstract: The emergence of Multimodal Large Language Models (MLLMs) that integrate vision and language modalities has unlocked new potentials for scientific reasoning, outperforming prior benchmarks in both natural language and coding domains. Current materials science evaluation datasets such as MaScQA and SciQA remain largely text-based and fail to capture the visual and research-level analytic complexity required in materials discovery and design. We introduce MatVQA, a scalable benchmark specifically designed to address this gap. Generated via an automated pipeline, MArxivAgent, from recent materials literature, MatVQA features 1325 questions across four critical structure-property-performance (SPP) reasoning tasks. Uniquely, MatVQA employs an iterative process to eliminate textual shortcuts, compelling MLLMs to perform fine-grained, low-level visual analysis of material imagery (e.g., microscopy, diffraction patterns) integrated with multi-step scientific reasoning. Benchmarking 17 open- and closed-source MLLMs on MatVQA reveals substantial gaps in current multimodal reasoning capabilities. MatVQA benchmark data, along with evaluation code, is publicly available in \href{https://anonymous.4open.science/r/matvqa-1E01}{https://anonymous.4open.science/r/matvqa-1E01/README.md} to catalyze further research in applying MLLMs to complex materials science problems.

摘要

多模态大语言模型（MLLMs）通过整合视觉与语言模态，在科学推理领域展现出新的潜力，其表现已超越自然语言和编程领域的既有基准。当前材料科学评估数据集（如MaScQA和SciQA）仍主要基于文本，未能涵盖材料发现与设计所需的视觉信息及研究级分析复杂度。为此，我们提出MatVQA——一个专为解决此问题设计的可扩展基准。该数据集通过自动化流程MArxivAgent从最新材料学文献生成，包含1325个问题，覆盖四种关键的结构-性能-功能（SPP）推理任务。MatVQA采用迭代流程消除文本捷径，迫使MLLMs对材料图像（如显微图像、衍射图谱）进行细粒度底层视觉分析，并与多步骤科学推理相结合。对17个开源与闭源MLLMs的基准测试揭示了当前多模态推理能力的显著不足。MatVQA基准数据及评估代码已公开于\href{https://anonymous.4open.science/r/matvqa-1E01}{https://anonymous.4open.science/r/matvqa-1E01/README.md}，以推动MLLMs在复杂材料科学问题中的应用研究。

Abstract

arXiv:2505.18279v1 Announce Type: new Abstract: Complex tasks are increasingly delegated to ensembles of specialized LLM-based agents that reason, communicate, and coordinate actions-both among themselves and through interactions with external tools, APIs, and databases. While persistent memory has been shown to enhance single-agent performance, most approaches assume a monolithic, single-user context-overlooking the benefits and challenges of knowledge transfer across users under dynamic, asymmetric permissions. We introduce Collaborative Memory, a framework for multi-user, multi-agent environments with asymmetric, time-evolving access controls encoded as bipartite graphs linking users, agents, and resources. Our system maintains two memory tiers: (1) private memory-private fragments visible only to their originating user; and (2) shared memory-selectively shared fragments. Each fragment carries immutable provenance attributes (contributing agents, accessed resources, and timestamps) to support retrospective permission checks. Granular read policies enforce current user-agent-resource constraints and project existing memory fragments into filtered transformed views. Write policies determine fragment retention and sharing, applying context-aware transformations to update the memory. Both policies may be designed conditioned on system, agent, and user-level information. Our framework enables safe, efficient, and interpretable cross-user knowledge sharing, with provable adherence to asymmetric, time-varying policies and full auditability of memory operations.

摘要

复杂任务正越来越多地委托给由专业化基于大语言模型的智能体组成的协作系统，这些智能体能够进行推理、通信和协调行动——既通过彼此间的交互，也通过与外部工具、API及数据库的互动。虽然持久性记忆已被证明能提升单智能体性能，但现有方法大多基于单一用户场景下的单体架构，忽视了动态非对称权限环境下跨用户知识迁移的效益与挑战。我们提出协作记忆框架，这是一种适用于多用户多智能体环境的解决方案，其通过二分图编码用户、智能体与资源之间非对称且随时间演化的访问控制关系。该系统维护双层记忆结构：(1)仅对创建者可见的私有记忆片段；(2)选择性共享的公共记忆片段。每个片段均携带不可篡改的溯源属性（贡献智能体、访问资源及时间戳）以支持追溯式权限校验。细粒度读取策略强制执行当前用户-智能体-资源约束，并将现有记忆片段投影为经过筛选的转换视图。写入策略通过上下文感知的转换操作决定片段的保留与共享方式。这两类策略均可基于系统、智能体及用户层级信息进行定制设计。本框架实现了安全高效且可解释的跨用户知识共享，可证明地遵循非对称时变策略，并确保所有记忆操作具备完全可审计性。

Abstract

arXiv:2505.18531v1 Announce Type: new Abstract: Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: $\textbf{multi-modal generative reward modeling from RL}$ , where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and $\textbf{RL optimization from grouped comparison}$ , which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1\%$ , while the baseline RLHF is only $5.3\%$ . We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.

摘要

训练符合人类意图的多模态大语言模型（MLLMs）是一项长期挑战。传统基于单一分数的奖励模型在对齐任务中存在准确率低、泛化能力弱和可解释性差等问题，阻碍了强化学习人类反馈（RLHF）等对齐方法的进展。生成式奖励模型（GRMs）利用MLLMs固有的推理能力判别成对响应，但其成对范式难以推广至可学习的奖励机制。本文提出Generative RLHF-V——一个将GRMs与多模态RLHF相结合的新型对齐框架，采用两阶段流程：基于强化学习的多模态生成式奖励建模，通过强化学习引导GRMs主动捕捉人类意图并预测成对分数；以及基于分组比较的强化学习优化，通过响应分组比较提升多模态强化学习的评分精度。实验结果表明，除奖励模型的分布外泛化能力外，本框架在7个基准测试中将4种MLLMs性能提升18.1%，而基线RLHF仅提升5.3%。进一步验证表明，随着候选响应数量增加，Generative RLHF-V可实现近线性改进。代码与模型详见https://generative-rlhf-v.github.io。

Knowledge Grafting of Large Language Models

Abstract

arXiv:2505.18502v1 Announce Type: new Abstract: Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.

摘要

跨能力迁移是大型语言模型（LLM）研究中的关键挑战，其应用涵盖多任务集成、模型压缩和持续学习等领域。近期FuseLLM和FuseChat等研究表明，将多个模型能力迁移至轻量级模型可显著提升适应性与效率，这促使我们探索更高效的跨能力迁移方法。然而现有方法主要针对小型同构模型，限制了其适用性。对于大型异构模型，基于全参数微调的知识蒸馏往往忽视学生模型的固有容量并存在灾难性遗忘风险，而参数高效微调（PEFT）方法则难以有效吸收源LLM的知识。为此，我们提出GraftLLM——一种通过SkillPack格式将源模型能力存储至目标模型的新方法。该方法能保留通用能力、减少参数冲突，并支持无遗忘持续学习与模型融合。我们采用模块感知的自适应压缩策略对参数更新进行压缩，在保持任务特定知识的同时实现高效存储。生成的SkillPack可作为紧凑可迁移的知识载体，特别适用于异构模型融合与持续学习。多场景实验表明，GraftLLM在知识迁移、知识融合和无遗忘学习方面均优于现有技术，为跨能力迁移提供了可扩展的高效解决方案。代码已开源：https://github.com/duguodong7/GraftLLM。

PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning

Abstract

arXiv:2505.18563v1 Announce Type: new Abstract: Large-scale deep neural networks (DNN) exhibit excellent performance for various tasks. As DNNs and datasets grow, distributed training becomes extremely time-consuming and demands larger clusters. A main bottleneck is the resulting gradient aggregation overhead. While gradient compression and sparse collective communication techniques are commonly employed to alleviate network load, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. This paper introduces PacTrain, a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse. By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all-reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72 times compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.

摘要

大规模深度神经网络(DNN)在各种任务中展现出卓越性能。随着DNN和数据集规模的增长，分布式训练变得极其耗时且需要更大规模的集群。主要瓶颈在于梯度聚合带来的通信开销。虽然梯度压缩和稀疏集体通信技术常被用于缓解网络负载，但多数梯度压缩方案无法在保持精度的同时加速训练过程。本文提出PacTrain框架，通过将剪枝与稀疏梯度压缩相结合来加速分布式训练。神经网络的主动剪枝使得模型权重和梯度具有稀疏性。通过确保所有分布式训练节点共享梯度稀疏性的全局认知，我们可以在不影响精度的情况下实现轻量级压缩通信。研究表明，PacTrain压缩方案实现了接近最优的压缩策略，同时保持与全归约原语的兼容性。实验评估表明，在带宽受限条件下，针对典型视觉和语言模型训练任务，PacTrain相比最先进的压缩系统将训练吞吐量提升了1.25至8.72倍。

Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions

Abstract

arXiv:2505.18492v1 Announce Type: new Abstract: Mathematical reasoning lies at the heart of artificial intelligence, underpinning applications in education, program verification, and research-level mathematical discovery. Mathematical competitions, in particular, present two challenging problem types: theorem-proving, requiring rigorous proofs of stated conclusions, and answer-construction, involving hypothesizing and formally verifying mathematical objects. Large Language Models (LLMs) effectively generate creative candidate answers but struggle with formal verification, while symbolic provers ensure rigor but cannot efficiently handle creative conjecture generation. We introduce the Enumerate-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating LLM-based enumeration and pattern-driven conjecturing with formal theorem proving. We present ConstructiveBench, a dataset of 3,431 answer-construction problems in various math competitions with verified Lean formalizations. On the ConstructiveBench dataset, ECP improves the accuracy of answer construction from the Chain-of-Thought (CoT) baseline of 14.54% to 45.06% with the gpt-4.1-mini model. Moreover, combining with ECP's constructed answers, the state-of-the-art DeepSeek-Prover-V2-7B model generates correct proofs for 858 of the 3,431 constructive problems in Lean, achieving 25.01% accuracy, compared to 9.86% for symbolic-only baselines. Our code and dataset are publicly available at GitHub and HuggingFace, respectively.

摘要

数学推理是人工智能的核心基础，支撑着教育、程序验证和研究级数学发现等应用领域。数学竞赛尤其呈现出两类具有挑战性的问题类型：定理证明（要求对既定结论进行严格证明）和答案构建（涉及数学对象的假设与形式化验证）。大语言模型（LLMs）能有效生成创造性候选答案，但在形式化验证方面存在不足；而符号证明器虽能确保严谨性，却无法高效处理创造性猜想生成。我们提出枚举-猜想-证明（ECP）框架，这是一种模块化神经符号方法，整合了基于LLM的枚举、模式驱动猜想与形式化定理证明。我们构建了ConstructiveBench数据集，包含3,431道各类数学竞赛中的答案构建问题，并配有经过验证的Lean形式化代码。在ConstructiveBench数据集上，ECP框架将答案构建的准确率从思维链（CoT）基线的14.54%提升至45.06%（使用gpt-4.1-mini模型）。此外，结合ECP构建的答案，最先进的DeepSeek-Prover-V2-7B模型为3,431道构造性问题中的858道生成了正确的Lean证明，准确率达25.01%，而纯符号基线的准确率仅为9.86%。我们的代码和数据集已分别在GitHub和HuggingFace平台公开。

MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures -- A Comprehensive Framework

Abstract

arXiv:2505.18572v1 Announce Type: new Abstract: Large Language Models (LLMs)-based Multi-Agent Systems (MAS) exhibit remarkable problem-solving and task planning capabilities across diverse domains due to their specialized agentic roles and collaborative interactions. However, this also amplifies the severity of security risks under MAS attacks. To address this, we introduce MASTER, a novel security research framework for MAS, focusing on diverse Role configurations and Topological structures across various scenarios. MASTER offers an automated construction process for different MAS setups and an information-flow-based interaction paradigm. To tackle MAS security challenges in varied scenarios, we design a scenario-adaptive, extensible attack strategy utilizing role and topological information, which dynamically allocates targeted, domain-specific attack tasks for collaborative agent execution. Our experiments demonstrate that such an attack, leveraging role and topological information, exhibits significant destructive potential across most models. Additionally, we propose corresponding defense strategies, substantially enhancing MAS resilience across diverse scenarios. We anticipate that our framework and findings will provide valuable insights for future research into MAS security challenges.

摘要

基于大语言模型（LLM）的多智能体系统（MAS）凭借其专业化的智能体角色与协同交互机制，在跨领域问题解决和任务规划方面展现出卓越能力。然而这也使得系统在遭受MAS攻击时的安全风险严重性被放大。为此，我们提出MASTER——一个面向多智能体系统安全研究的新型框架，重点关注不同场景下的角色配置与拓扑结构。该框架提供自动化构建多样化MAS配置的流程，以及基于信息流的交互范式。针对多场景下的MAS安全挑战，我们设计了一种利用角色与拓扑信息的场景自适应可扩展攻击策略，能够动态分配针对特定领域的目标攻击任务供智能体协作执行。实验表明，此类利用角色与拓扑信息的攻击对多数模型均具有显著破坏潜力。此外，我们提出了相应防御策略，可显著提升多场景下MAS的韧性。我们期待该框架及研究发现能为未来MAS安全挑战研究提供重要启示。

Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?

Abstract

arXiv:2505.18575v1 Announce Type: new Abstract: Probing techniques have shown promise in revealing how LLMs encode human-interpretable concepts, particularly when applied to curated datasets. However, the factors governing a dataset's suitability for effective probe training are not well-understood. This study hypothesizes that probe performance on such datasets reflects characteristics of both the LLM's generated responses and its internal feature space. Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. Our findings indicate that high LLM response variance is associated with a larger set of important features, which poses a greater challenge for probe models and often results in diminished performance. Moreover, leveraging the insights from response uncertainty analysis, we are able to identify concrete examples where LLM representations align with human knowledge across diverse domains, offering additional evidence of interpretable reasoning in LLMs.

摘要

探测技术在揭示大语言模型如何编码人类可解释概念方面展现出潜力，尤其在应用于精选数据集时表现突出。然而，目前对决定数据集是否适合有效训练探测器的因素仍缺乏深入理解。本研究提出假设：此类数据集上的探测器性能同时反映了大语言模型生成响应及其内部特征空间的特性。通过对系列任务中探测器性能与模型响应不确定性的定量分析，我们发现存在强相关性：探测器性能提升始终伴随响应不确定性的降低，反之亦然。随后，我们通过特征重要性分析的视角深入探究这种关联。研究结果表明，高模型响应方差与更庞大的重要特征集合相关，这为探测模型带来了更大挑战并通常导致性能下降。此外，基于响应不确定性分析的发现，我们能够识别出大语言模型表征与多领域人类知识相吻合的具体案例，这为模型可解释推理提供了新的实证依据。

RvLLM: LLM Runtime Verification with Domain Knowledge

Abstract

arXiv:2505.18585v1 Announce Type: new Abstract: Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.

摘要

大型语言模型（LLMs）因其卓越的文本理解和生成能力，已成为人工智能领域的主导范式。然而，其生成不一致或错误输出的倾向对可靠性提出了挑战，尤其是在需要精确性和可信度的高风险领域。现有研究主要集中于检测和缓解通用场景中的模型错误行为，往往忽视了整合领域特定知识的潜力。本研究通过融入领域知识，推进了错误行为检测。核心思想是设计一种通用规范语言，使领域专家能够以轻量级且直观的方式定制领域特定谓词，从而支持后续对LLM输出的运行时验证。为此，我们设计了一种新颖的规范语言ESL，并引入了一个运行时验证框架RvLLM，用于根据ESL中定义的领域特定约束验证LLM输出。我们在三个代表性任务上评估了RvLLM：新加坡快速交通系统法案违规检测、数值比较和不等式求解。实验结果表明，RvLLM以轻量级且灵活的方式有效检测了各种LLM的错误输出。结果揭示，尽管LLM能力出众，但由于推理过程中可解释性有限且缺乏形式化保证，它们仍容易犯低级错误，而我们的框架通过利用专家领域知识严格高效地验证LLM输出，提供了潜在的长期解决方案。

LLMs for Supply Chain Management

Abstract

arXiv:2505.18597v1 Announce Type: new Abstract: The development of large language models (LLMs) has provided new tools for research in supply chain management (SCM). In this paper, we introduce a retrieval-augmented generation (RAG) framework that dynamically integrates external knowledge into the inference process, and develop a domain-specialized SCM LLM, which demonstrates expert-level competence by passing standardized SCM examinations and beer game tests. We further employ the use of LLMs to conduct horizontal and vertical supply chain games, in order to analyze competition and cooperation within supply chains. Our experiments show that RAG significantly improves performance on SCM tasks. Moreover, game-theoretic analysis reveals that the LLM can reproduce insights from the classical SCM literature, while also uncovering novel behaviors and offering fresh perspectives on phenomena such as the bullwhip effect. This paper opens the door for exploring cooperation and competition for complex supply chain network through the lens of LLMs.

摘要

大型语言模型（LLM）的发展为供应链管理（SCM）研究提供了新工具。本文提出一种检索增强生成（RAG）框架，该框架能将外部知识动态整合至推理过程，并开发出具备领域专业性的SCM-LLM模型。该模型通过标准化SCM考试和啤酒游戏测试，展现出专家级能力。我们进一步运用LLM开展横向与纵向供应链博弈，以分析供应链内的竞争与合作。实验表明RAG能显著提升SCM任务表现。博弈论分析揭示该LLM既能复现经典SCM文献的洞见，又能发现新行为，为牛鞭效应等现象提供新视角。本研究为通过LLM探索复杂供应链网络的合作与竞争机制开辟了新途径。

Knowledge Retrieval in LLM Gaming: A Shift from Entity-Centric to Goal-Oriented Graphs

Abstract

arXiv:2505.18607v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate impressive general capabilities but often struggle with step-by-step reasoning, especially in complex applications such as games. While retrieval-augmented methods like GraphRAG attempt to bridge this gap through cross-document extraction and indexing, their fragmented entity-relation graphs and overly dense local connectivity hinder the construction of coherent reasoning. In this paper, we propose a novel framework based on Goal-Oriented Graphs (GoGs), where each node represents a goal and its associated attributes, and edges encode logical dependencies between goals. This structure enables explicit retrieval of reasoning paths by first identifying high-level goals and recursively retrieving their subgoals, forming coherent reasoning chains to guide LLM prompting. Our method significantly enhances the reasoning ability of LLMs in game-playing tasks, as demonstrated by extensive experiments on the Minecraft testbed, outperforming GraphRAG and other baselines.

摘要

大型语言模型（LLMs）展现出卓越的通用能力，但在逐步推理任务中常面临困难，尤其在游戏等复杂应用场景。尽管GraphRAG等基于检索增强的方法尝试通过跨文档信息提取与索引来弥补这一缺陷，但其碎片化的实体-关系图和过度稠密的局部连接阻碍了连贯推理链的构建。本文提出一种基于目标导向图（GoGs）的新型框架：节点表示目标及其关联属性，边编码目标间的逻辑依赖关系。该结构通过先识别高层目标、再递归检索子目标的方式实现推理路径的显式检索，从而形成连贯的推理链以指导LLM提示。在Minecraft测试平台上的大量实验表明，本方法显著提升了LLMs在游戏任务中的推理能力，其表现优于GraphRAG及其他基线模型。

AI for Regulatory Affairs: Balancing Accuracy, Interpretability, and Computational Cost in Medical Device Classification

Abstract

arXiv:2505.18695v1 Announce Type: new Abstract: Regulatory affairs, which sits at the intersection of medicine and law, can benefit significantly from AI-enabled automation. Classification task is the initial step in which manufacturers position their products to regulatory authorities, and it plays a critical role in determining market access, regulatory scrutiny, and ultimately, patient safety. In this study, we investigate a broad range of AI models -- including traditional machine learning (ML) algorithms, deep learning architectures, and large language models -- using a regulatory dataset of medical device descriptions. We evaluate each model along three key dimensions: accuracy, interpretability, and computational cost.

摘要

作为医药学与法学交叉领域的监管事务，可从人工智能驱动的自动化中显著获益。分类任务是制造商向监管机构申报产品定位的首要环节，对市场准入、监管审查乃至患者安全具有决定性作用。本研究基于医疗器械描述数据集，系统考察了传统机器学习算法、深度学习架构及大语言模型在内的多种人工智能模型。我们从准确性、可解释性和计算成本三个关键维度对各类模型进行了全面评估。

Abstract

arXiv:2505.18603v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have made significant progress in document understanding. However, the information-dense nature of document images still poses challenges, as most queries depend on only a few relevant regions, with the rest being redundant. Existing one-pass MLLMs process entire document images without considering query relevance, often failing to focus on critical regions and producing unfaithful responses. Inspired by the human coarse-to-fine reading pattern, we introduce Doc-CoB (Chain-of-Box), a simple-yet-effective mechanism that integrates human-style visual reasoning into MLLM without modifying its architecture. Our method allows the model to autonomously select the set of regions (boxes) most relevant to the query, and then focus attention on them for further understanding. We first design a fully automatic pipeline, integrating a commercial MLLM with a layout analyzer, to generate 249k training samples with intermediate visual reasoning supervision. Then we incorporate two enabling tasks that improve box identification and box-query reasoning, which together enhance document understanding. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability. All code, data, and models will be released publicly.

摘要

多模态大语言模型（MLLMs）在文档理解领域取得了显著进展。然而，文档图像信息密集的特性仍带来挑战，因为大多数查询仅依赖于少数相关区域，其余部分则冗余。现有的一阶段MLLMs在未考虑查询相关性的情况下处理整个文档图像，往往难以聚焦关键区域并产生不可靠的响应。受人类由粗到细阅读模式的启发，我们提出了Doc-CoB（链式框选）机制，这一简洁高效的方案在不修改模型架构的前提下，将类人视觉推理能力融入MLLM。该方法使模型能自主选择与查询最相关的区域（框）集合，进而集中注意力进行深度理解。我们首先设计了一个全自动流程，将商用MLLM与布局分析器结合，生成24.9万条带有中间视觉推理监督的训练样本。随后引入两项赋能任务以提升框选识别和框-查询推理能力，共同增强文档理解性能。在四个主流模型上对七个基准测试的广泛实验表明，Doc-CoB显著提升了性能，验证了其有效性和广泛适用性。所有代码、数据及模型将公开释放。

AI-Researcher: Autonomous Scientific Innovation

Abstract

arXiv:2505.18705v1 Announce Type: new Abstract: The powerful reasoning capabilities of Large Language Models (LLMs) in mathematics and coding, combined with their ability to automate complex tasks through agentic frameworks, present unprecedented opportunities for accelerating scientific innovation. In this paper, we introduce AI-Researcher, a fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated. Our framework seamlessly orchestrates the complete research pipeline--from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation--with minimal human intervention. To rigorously assess autonomous research capabilities, we develop Scientist-Bench, a comprehensive benchmark comprising state-of-the-art papers across diverse AI research domains, featuring both guided innovation and open-ended exploration tasks. Through extensive experiments, we demonstrate that AI-Researcher achieves remarkable implementation success rates and produces research papers that approach human-level quality. This work establishes new foundations for autonomous scientific innovation that can complement human researchers by systematically exploring solution spaces beyond cognitive limitations.

摘要

大型语言模型（LLMs）在数学与编程领域强大的推理能力，结合其通过智能体框架自动化复杂任务的特点，为加速科学创新提供了前所未有的机遇。本文提出AI-Researcher——一个彻底变革AI驱动科研工作方式与评估体系的完全自主研究系统。该框架能无缝协调从文献综述、假设生成到算法实现及可发表级论文撰写的完整研究流程，仅需极少量人工干预。为系统评估自主科研能力，我们开发了Scientist-Bench综合基准测试，涵盖多个人工智能研究领域的前沿论文，包含定向创新与开放式探索双重任务。大量实验表明，AI-Researcher不仅实现了显著的实施方案成功率，其产出的研究论文质量更接近人类水平。本研究为突破认知局限、系统性探索解决方案空间的自主科学创新奠定了新基础，未来可与人类研究者形成互补。

MLLMs are Deeply Affected by Modality Bias

Abstract

arXiv:2505.18657v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images. MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs. This position paper argues that MLLMs are deeply affected by modality bias. Firstly, we diagnose the current state of modality bias, highlighting its manifestations across various tasks. Secondly, we propose a systematic research road-map related to modality bias in MLLMs. Thirdly, we identify key factors of modality bias in MLLMs and offer actionable suggestions for future research to mitigate it. To substantiate these findings, we conduct experiments that demonstrate the influence of each factor: 1. Data Characteristics: Language data is compact and abstract, while visual data is redundant and complex, creating an inherent imbalance in learning dynamics. 2. Imbalanced Backbone Capabilities: The dominance of pretrained language models in MLLMs leads to overreliance on language and neglect of visual information. 3. Training Objectives: Current objectives often fail to promote balanced cross-modal alignment, resulting in shortcut learning biased toward language. These findings highlight the need for balanced training strategies and model architectures to better integrate multiple modalities in MLLMs. We call for interdisciplinary efforts to tackle these challenges and drive innovation in MLLM research. Our work provides a fresh perspective on modality bias in MLLMs and offers insights for developing more robust and generalizable multimodal systems-advancing progress toward Artificial General Intelligence.

摘要

多模态大语言模型（MLLMs）的最新进展在整合文本与图像等多样模态方面展现出显著潜力。然而，MLLMs深受模态偏差影响，往往过度依赖语言模态而忽视视觉输入等其他模态的充分利用。本立场文件论证了MLLMs中存在深层次的模态偏差问题：首先，我们系统诊断了当前模态偏差的表现形式及其在不同任务中的影响；其次，提出针对MLLMs模态偏差的系统性研究路线图；第三，揭示了导致模态偏差的关键因素，并为未来研究提供可操作的缓解建议。通过实验验证，我们证实了以下核心因素的影响机制：1. 数据特性——语言数据具有紧凑性和抽象性，而视觉数据存在冗余性与复杂性，这种固有差异导致学习动态失衡；2. 骨干能力失衡——预训练语言模型在MLLMs中的主导地位引发对语言模态的过度依赖；3. 训练目标缺陷——现有目标函数难以实现跨模态均衡对齐，导致模型倾向于语言捷径学习。这些发现表明，需要开发均衡的训练策略与模型架构以实现多模态的有效整合。我们呼吁跨学科协作应对这些挑战，推动MLLM研究的创新发展。本研究为理解MLLMs中的模态偏差提供了新视角，并为构建更具鲁棒性和泛化性的多模态系统提供了理论依据，这对推进通用人工智能发展具有重要意义。

AI-Driven Climate Policy Scenario Generation for Sub-Saharan Africa

Abstract

arXiv:2505.18694v1 Announce Type: new Abstract: Climate policy scenario generation and evaluation have traditionally relied on integrated assessment models (IAMs) and expert-driven qualitative analysis. These methods enable stakeholders, such as policymakers and researchers, to anticipate impacts, plan governance strategies, and develop mitigation measures. However, traditional methods are often time-intensive, reliant on simple extrapolations of past trends, and limited in capturing the complex and interconnected nature of energy and climate issues. With the advent of artificial intelligence (AI), particularly generative AI models trained on vast datasets, these limitations can be addressed, ensuring robustness even under limited data conditions. In this work, we explore the novel method that employs generative AI, specifically large language models (LLMs), to simulate climate policy scenarios for Sub-Saharan Africa. These scenarios focus on energy transition themes derived from the historical United Nations Climate Change Conference (COP) documents. By leveraging generative models, the project aims to create plausible and diverse policy scenarios that align with regional climate goals and energy challenges. Given limited access to human evaluators, automated techniques were employed for scenario evaluation. We generated policy scenarios using the llama3.2-3B model. Of the 34 generated responses, 30 (88%) passed expert validation, accurately reflecting the intended impacts provided in the corresponding prompts. We compared these validated responses against assessments from a human climate expert and two additional LLMs (gemma2-2B and mistral-7B). Our structured, embedding-based evaluation framework shows that generative AI effectively generate scenarios that are coherent, relevant, plausible, and diverse. This approach offers a transformative tool for climate policy planning in data-constrained regions.

摘要

气候政策情景生成与评估传统上依赖于综合评估模型（IAMs）和专家驱动的定性分析。这些方法使政策制定者和研究人员等利益相关者能够预测影响、规划治理策略并制定缓解措施。然而，传统方法通常耗时较长，依赖于对历史趋势的简单外推，且在捕捉能源与气候问题复杂互联性方面存在局限。随着人工智能（AI）的发展，特别是基于海量数据训练的生成式AI模型，这些限制得以解决，即使在数据有限条件下也能确保稳健性。本研究探索了一种创新方法，利用生成式AI（尤其是大语言模型LLMs）模拟撒哈拉以南非洲的气候政策情景。这些情景聚焦于从历届联合国气候变化大会（COP）文件提取的能源转型主题。通过生成模型，该项目旨在创建符合区域气候目标和能源挑战的合理且多样化的政策情景。由于人类评估者资源有限，研究采用自动化技术进行情景评估。我们使用llama3.2-3B模型生成政策情景，在34条生成响应中，30条（88%）通过专家验证，准确反映了对应提示中的预期影响。我们将这些验证响应与人类气候专家及另外两个LLM模型（gemma2-2B和mistral-7B）的评估结果进行对比。基于嵌入的结构化评估框架表明，生成式AI能有效生成连贯、相关、合理且多样化的情景。该方法为数据受限地区的气候政策规划提供了变革性工具。

$C^3$ -Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

Abstract

arXiv:2505.18746v1 Announce Type: new Abstract: Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark $C^3$ -Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, $C^3$ -Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at https://github.com/yupeijei1997/C3-Bench.

摘要

基于大语言模型的智能体通过工具操作改变环境，正在彻底革新人工智能与物理世界的交互方式。与传统自然语言处理任务仅依赖历史对话生成响应不同，此类智能体在决策时需综合考虑工具间关联性、环境反馈和历史选择等复杂因素。当前研究通常通过多轮对话评估智能体性能，却忽视了这些关键因素对智能体行为的影响。为填补这一空白，我们提出了开源高质量基准测试集 $C^3$ -Bench。该基准融合攻击概念并采用单变量分析，精准识别影响智能体鲁棒性的关键要素。具体而言，我们设计了三大挑战：复杂工具关系导航、关键隐藏信息处理和动态决策路径管理。配合这些挑战，我们引入了细粒度评估指标、创新的数据收集算法和可复现的评测方法。通过对49个主流智能体（包括通用快思考、慢思考及领域专用模型）的大规模实验，我们发现现有智能体在处理工具依赖性、长上下文信息关联和频繁策略切换方面存在显著缺陷。本质上， $C^3$ -Bench旨在通过这些挑战暴露模型弱点，并推动智能体性能可解释性研究。本基准已开源发布：https://github.com/yupeijei1997/C3-Bench。

Mitigating Deceptive Alignment via Self-Monitoring

Abstract

arXiv:2505.18807v1 Announce Type: new Abstract: Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io

摘要

现代大型语言模型依赖思维链（CoT）推理实现卓越性能，但该机制也可能放大欺骗性对齐现象——模型表面合规却暗中追求未对齐目标。现有安全方案将欺骗视为需事后过滤的黑箱输出，放任模型在内部推理中持续谋划。我们提出核心问题：能否在模型思考过程中拦截欺骗行为？为此，我们首次提出在CoT流程内部嵌入自我监控框架CoT Monitor+。该框架在生成时同步产生：（i）常规推理步骤；（ii）经训练的内部自评估信号，用于标记并抑制未对齐策略。该信号作为强化学习的辅助奖励，形成促进诚实推理、遏制隐藏目标的反馈循环。为系统研究欺骗性对齐，我们构建DeceptionBench基准测试，涵盖伪装对齐、谄媚行为等五类探测任务。评估表明，无约束CoT平均会加剧42.8%的欺骗倾向，而CoT Monitor+在保持任务准确率的同时，将欺骗行为削减43.8%。进一步研究发现，当自监控信号替代RL微调中的外部弱评估器时，模型显著减少模糊思维并保持透明度。项目网站见cot-monitor-plus.github.io。

The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

Abstract

arXiv:2505.18759v1 Announce Type: new Abstract: Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.

摘要

以数据为中心的蒸馏方法（包括数据增强、选择和混合）为创建更小、更高效且保持强大推理能力的学生大语言模型（LLM）提供了一条有前景的路径。然而，目前仍缺乏一个全面的基准来系统评估每种蒸馏方法的效果。本文提出了DC-CoT，这是首个从方法、模型和数据角度系统研究思维链（CoT）蒸馏中数据操作的以数据为中心的基准。通过利用多种教师模型（如o4-mini、Gemini-Pro、Claude-3.5）和学生架构（如3B、7B参数），我们严格评估了这些数据操作对学生模型在多个推理数据集上性能的影响，重点关注分布内（IID）和分布外（OOD）泛化能力以及跨领域迁移。我们的研究旨在为通过以数据为中心的技术优化CoT蒸馏提供可操作的见解，并建立最佳实践，最终推动开发更易获取且能力更强的推理模型。

Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

Abstract

arXiv:2505.18907v1 Announce Type: new Abstract: Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that encode the privilege information. Our evaluations across multiple models and training methods reveal that our proposal yields between $1.6\times$ and $9.2\times$ reduction in attack success rate on gradient-based prompt injection attacks compared to state-of-the-art methods, without significantly degrading the model's utility.

摘要

提示注入攻击是大型语言模型（LLMs）中一种关键的安全漏洞，攻击者通过在输入上下文中注入恶意指令来劫持模型行为。现有防御机制通常采用指令层级（IH）信号，通过特殊分隔符标记或附加嵌入来表示输入标记的权限级别。然而，这些方法通常仅在初始输入层注入IH信号，我们假设这会限制其在模型各层传播过程中有效区分标记权限的能力。为克服这一局限，我们提出一种新方法，将IH信号注入网络中的中间标记表示。该方法通过特定于层的可训练嵌入来增强这些表示，从而编码权限信息。我们在多种模型和训练方法上的评估表明，与现有最优方法相比，该方案在基于梯度的提示注入攻击中实现了攻击成功率降低1.6至9.2倍的效果，且未显著影响模型效用。

LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS

Abstract

arXiv:2505.18829v1 Announce Type: new Abstract: We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.

摘要

我们推出AIOS 1.0这一创新平台，旨在通过环境情境化提升计算机使用代理（CUA）的能力。现有方法主要聚焦于构建更强大的代理框架或增强代理模型，但我们发现一个根本性局限：语言模型对世界的理解方式与计算机界面结构之间存在语义断层。AIOS 1.0通过将计算机转化为语言模型可原生理解的情境化环境，采用模型情境协议（MCP）服务器架构来抽象计算机状态与动作，从而有效解决这一挑战。该方法实现了界面复杂度与决策复杂度的解耦，使代理能更高效地推理计算环境。为验证平台效能，我们基于AIOS 1.0开发了轻量级计算机使用代理LiteCUA，其在OSWorld基准测试中取得14.66%的成功率，尽管架构简单却优于多个专用代理框架。研究结果表明，为语言模型构建计算机环境情境化是实现更强大计算机使用代理、推进AI与数字系统交互的重要方向。LiteCUA源代码发布于https://github.com/agiresearch/LiteCUA，并作为AIOS组成部分集成于主分支https://github.com/agiresearch/AIOS。

AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting

Abstract

arXiv:2505.18822v1 Announce Type: new Abstract: Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget allocation and explicit user control over reasoning depth. AdaCtrl dynamically adjusts its reasoning length based on self-assessed problem difficulty, while also allowing users to manually control the budget to prioritize either efficiency or effectiveness. This is achieved through a two-stage training pipeline: an initial cold-start fine-tuning phase to instill the ability to self-aware difficulty and adjust reasoning budget, followed by a difficulty-aware reinforcement learning (RL) stage that refines the model's adaptive reasoning strategies and calibrates its difficulty assessments based on its evolving capabilities during online training. To enable intuitive user interaction, we design explicit length-triggered tags that function as a natural interface for budget control. Empirical results show that AdaCtrl adapts reasoning length based on estimated difficulty, compared to the standard training baseline that also incorporates fine-tuning and RL, it yields performance improvements and simultaneously reduces response length by 10.06% and 12.14% on the more challenging AIME2024 and AIME2025 datasets, which require elaborate reasoning, and by 62.05% and 91.04% on the MATH500 and GSM8K datasets, where more concise responses are sufficient. Furthermore, AdaCtrl enables precise user control over the reasoning budget, allowing for tailored responses to meet specific needs.

摘要

现代大型推理模型通过采用复杂的推理策略展现出令人印象深刻的问题解决能力。然而，这些模型往往难以平衡效率与效果，经常为简单问题生成不必要的冗长推理链。本研究提出AdaCtrl框架，该创新系统同时支持难度感知的自适应推理预算分配和用户对推理深度的显式控制。AdaCtrl根据自评估的问题难度动态调整推理长度，同时允许用户手动控制预算以优先考虑效率或效果。这一功能通过两阶段训练流程实现：首先是冷启动微调阶段，用于培养模型对难度的自我认知和推理预算调整能力；随后是难度感知强化学习阶段，该阶段优化模型的自适应推理策略，并根据在线训练过程中不断演进的能力校准其难度评估。为实现直观的用户交互，我们设计了显式的长度触发标签作为预算控制的自然界面。实验结果表明，相较于同样包含微调和强化学习的标准训练基线，AdaCtrl能基于预估难度调整推理长度——在需要精细推理的AIME2024和AIME2025数据集上，性能提升的同时分别减少10.06%和12.14%的响应长度；而在更简短响应即可满足需求的MATH500和GSM8K数据集上，缩减幅度分别达到62.05%和91.04%。此外，AdaCtrl实现了对推理预算的精确用户控制，可生成满足特定需求的定制化响应。

Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework

Abstract

arXiv:2505.18847v1 Announce Type: new Abstract: Recent advances have increasingly applied large language models (LLMs) to electrocardiogram (ECG) interpretation, giving rise to Electrocardiogram-Language Models (ELMs). Conditioned on an ECG and a textual query, an ELM autoregressively generates a free-form textual response. Unlike traditional classification-based systems, ELMs emulate expert cardiac electrophysiologists by issuing diagnoses, analyzing waveform morphology, identifying contributing factors, and proposing patient-specific action plans. To realize this potential, researchers are curating instruction-tuning datasets that pair ECGs with textual dialogues and are training ELMs on these resources. Yet before scaling ELMs further, there is a fundamental question yet to be explored: What is the most effective ECG input representation? In recent works, three candidate representations have emerged-raw time-series signals, rendered images, and discretized symbolic sequences. We present the first comprehensive benchmark of these modalities across 6 public datasets and 5 evaluation metrics. We find symbolic representations achieve the greatest number of statistically significant wins over both signal and image inputs. We further ablate the LLM backbone, ECG duration, and token budget, and we evaluate robustness to signal perturbations. We hope that our findings offer clear guidance for selecting input representations when developing the next generation of ELMs.

摘要

近年来，大型语言模型（LLMs）在心电图（ECG）解读中的应用日益增多，催生了心电图-语言模型（ELMs）。基于心电图和文本查询的条件，ELM能够自回归生成自由形式的文本响应。与传统基于分类的系统不同，ELMs通过发布诊断、分析波形形态、识别影响因素并提出针对患者的个性化行动计划，模拟了心脏电生理学专家的行为。为实现这一潜力，研究人员正在整理将心电图与文本对话配对的教学调优数据集，并基于这些资源训练ELMs。然而，在进一步扩展ELMs之前，一个尚未探索的基本问题是：最有效的心电图输入表示是什么？在最近的研究中，出现了三种候选表示形式——原始时间序列信号、渲染图像和离散化符号序列。我们首次对这些模态在6个公共数据集和5个评估指标上进行了全面基准测试。研究发现，符号表示在统计显著性上优于信号和图像输入的情况最多。我们进一步对LLM主干、ECG持续时间和令牌预算进行了消融实验，并评估了对信号扰动的鲁棒性。希望我们的研究结果为开发下一代ELMs时选择输入表示提供了明确的指导。

SQUiD: Synthesizing Relational Databases from Unstructured Text

Abstract

arXiv:2505.19025v1 Announce Type: new Abstract: Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.

摘要

关系数据库是现代数据管理的核心，但大多数数据以非结构化形式（如文本文档）存在。为弥合这一鸿沟，我们利用大语言模型（LLM）从原始文本自动生成数据库模式并填充表格，从而合成关系数据库。本文提出新型神经符号框架SQUiD，将该任务分解为四个阶段，每个阶段采用专门技术。实验表明，SQUiD在多样化数据集上始终优于基线方法。

REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing

Abstract

arXiv:2505.18933v1 Announce Type: new Abstract: Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional "belief shift" vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.

摘要

大语言模型编辑方法普遍存在过拟合问题，即事实更新可能超出预期范围，即使在不恰当的语境下也会过度强调编辑目标。为解决这一挑战，我们提出REACT（表征提取与可控调谐）框架，这是一个为精确可控知识编辑设计的统一两阶段方案。第一阶段通过定制化刺激提取潜在事实表征，并采用主成分分析与可学习的线性变换计算每个实例的方向性"信念偏移"向量。第二阶段利用所得向量及幅度标量对隐藏状态施加可控扰动，其门控机制由预训练分类器实现，仅在语境需要时允许编辑。在EVOKE基准测试中的实验表明，REACT在几乎所有评估指标上显著降低了过拟合现象；而在COUNTERFACT和MQuAKE上的实验证明，该方法在多样化编辑场景下能保持可靠度、局部性与泛化性等基础编辑性能的平衡。

Can Large Language Models Infer Causal Relationships from Real-World Text?

Abstract

arXiv:2505.18931v1 Announce Type: new Abstract: Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work primarily focuses on synthetically generated texts which involve simple causal relationships explicitly mentioned in the text. This fails to reflect the complexities of real-world tasks. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature which includes diverse texts with respect to length, complexity of relationships (different levels of explicitness, number of events, and causal relationships), and domains and sub-domains. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on state-of-the-art LLMs evaluated on our proposed benchmark demonstrate significant challenges, with the best-performing model achieving an average F1 score of only 0.477. Analysis reveals common pitfalls: difficulty with implicitly stated information, in distinguishing relevant causal factors from surrounding contextual details, and with connecting causally relevant information spread across lengthy textual passages. By systematically characterizing these deficiencies, our benchmark offers targeted insights for further research into advancing LLM causal reasoning.

摘要

从文本中理解和推断因果关系是人类认知的核心方面，也是推动大语言模型（LLMs）迈向通用人工智能的关键。现有研究主要集中于合成生成的文本，这些文本仅涉及文中明确提及的简单因果关系，未能反映现实任务的复杂性。本文探究LLMs能否从现实世界文本中推断因果关系。我们构建了一个源自真实学术文献的基准测试集，包含长度各异、关系复杂度不同（明确性程度、事件数量及因果关系的差异）以及跨领域和子领域的多样化文本。据我们所知，这是该任务首个真实世界数据集。基于该基准对前沿LLMs的实验表明存在重大挑战，表现最佳模型的平均F1分数仅为0.477。分析揭示了常见缺陷：难以处理隐含信息、无法区分相关因果因素与上下文细节、以及难以整合分散在长文本中的因果相关信息。通过系统化表征这些不足，我们的基准为推进LLM因果推理的后续研究提供了针对性启示。

Meta-aware Learning in text-to-SQL Large Language Model

Abstract

arXiv:2505.18929v1 Announce Type: new Abstract: The advancements of Large language models (LLMs) have provided great opportunities to text-to-SQL tasks to overcome the main challenges to understand complex domain information and complex database structures in business applications. In this paper, we propose a meta-aware learning framework to integrate domain knowledge, database schema, chain-of-thought reasoning processes, and metadata relationships to improve the SQL generation quality. The proposed framework includes four learning strategies: schema-based learning, Chain-of-Thought (CoT) learning, knowledge-enhanced learning, and key information tokenization. This approach provides a comprehensive understanding of database structure and metadata information towards LLM through fine-tuning to improve its performance on SQL generation within business domains. Through two experimental studies, we have demonstrated the superiority of the proposed methods in execution accuracy, multi-task SQL generation capability, and reduction of catastrophic forgetting.

摘要

大型语言模型（LLM）的进步为文本到SQL任务提供了重要机遇，以克服商业应用中理解复杂领域信息和复杂数据库结构的主要挑战。本文提出一种元感知学习框架，通过整合领域知识、数据库模式、思维链推理过程及元数据关系来提升SQL生成质量。该框架包含四种学习策略：基于模式的学习、思维链（CoT）学习、知识增强学习和关键信息标记化。该方法通过微调使LLM全面理解数据库结构和元数据信息，从而提升其在商业领域内SQL生成的性能。通过两项实验研究，我们验证了所提方法在执行准确率、多任务SQL生成能力以及减少灾难性遗忘方面的优越性。

Aligning LLM with human travel choices: a persona-based embedding learning approach

Abstract

arXiv:2505.19003v1 Announce Type: new Abstract: The advent of large language models (LLMs) presents new opportunities for travel demand modeling. However, behavioral misalignment between LLMs and humans presents obstacles for the usage of LLMs, and existing alignment methods are frequently inefficient or impractical given the constraints of typical travel demand data. This paper introduces a novel framework for aligning LLMs with human travel choice behavior, tailored to the current travel demand data sources. Our framework uses a persona inference and loading process to condition LLMs with suitable prompts to enhance alignment. The inference step establishes a set of base personas from empirical data, and a learned persona loading function driven by behavioral embeddings guides the loading process. We validate our framework on the Swissmetro mode choice dataset, and the results show that our proposed approach significantly outperformed baseline choice models and LLM-based simulation models in predicting both aggregate mode choice shares and individual choice outcomes. Furthermore, we showcase that our framework can generate insights on population behavior through interpretable parameters. Overall, our research offers a more adaptable, interpretable, and resource-efficient pathway to robust LLM-based travel behavior simulation, paving the way to integrate LLMs into travel demand modeling practice in the future.

摘要

大型语言模型（LLMs）的出现为交通需求建模带来了新的机遇。然而，LLMs与人类行为之间的偏差阻碍了其应用，且现有对齐方法在典型交通需求数据限制下往往效率低下或难以实施。本文提出一种新颖的框架，旨在使LLMs与人类出行选择行为对齐，并适应当前交通数据源特点。该框架通过角色推断与加载流程，利用合适的提示词对LLMs进行条件约束以提升对齐效果：推断步骤从实证数据中建立基础角色集，而由行为嵌入驱动的学习型角色加载函数则指导加载过程。我们在Swissmetro出行方式选择数据集上验证了该框架，结果表明所提方法在预测总体方式选择份额和个体选择结果方面，显著优于基线选择模型和基于LLM的仿真模型。此外，我们证明该框架可通过可解释参数生成群体行为洞见。总体而言，本研究为基于LLM的稳健交通行为仿真提供了更具适应性、可解释性且资源高效的路径，为未来将LLMs整合至交通需求建模实践奠定了基础。

Weaver: Interweaving SQL and LLM for Table Reasoning

Abstract

arXiv:2505.18961v1 Announce Type: new Abstract: Querying tables with unstructured data is challenging due to the presence of text (or image), either embedded in the table or in external paragraphs, which traditional SQL struggles to process, especially for tasks requiring semantic reasoning. While Large Language Models (LLMs) excel at understanding context, they face limitations with long input sequences. Existing approaches that combine SQL and LLMs typically rely on rigid, predefined work-flows, limiting their adaptability to complex queries. To address these issues, we introduce Weaver , a modular pipeline that dynamically integrates SQL and LLMs for table-based question answering (TableQA). Weaver generates a flexible, step-by-step plan that combines SQL for structured data retrieval with LLMs for semantic processing. By decomposing complex queries into manageable subtasks, Weaver improves accuracy and generalization. Our experiments show that Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates.

摘要

由于表格中存在嵌入文本（或图像）或外部段落中的非结构化数据，传统SQL难以处理此类查询任务，尤其是需要语义推理的场景。尽管大语言模型（LLMs）擅长上下文理解，但面对长输入序列时仍存在局限。现有结合SQL与LLMs的方法通常依赖僵化的预定义工作流程，难以适应复杂查询需求。为此，我们提出Weaver——一种模块化流水线，通过动态整合SQL与LLMs实现基于表格的问答（TableQA）。Weaver生成灵活的逐步执行计划，结合SQL的结构化数据检索与LLMs的语义处理能力。通过将复杂查询分解为可处理的子任务，该系统显著提升了准确性与泛化能力。实验表明，Weaver在四个TableQA数据集上持续优于现有最优方法，同时降低了API调用次数与错误率。

RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data

Abstract

arXiv:2505.19030v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly expected to tackle complex tasks, driven by their expanding applications and users' growing proficiency in crafting sophisticated prompts. However, as the number of explicitly stated requirements increases (particularly more than 10 constraints), LLMs often struggle to accurately follow such complex instructions. To address this challenge, we propose RECAST, a novel framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks. These constraints are extracted from real-world prompt-response pairs to ensure practical relevance. RECAST enables automatic verification of constraint satisfaction via rule-based validators for quantitative constraints and LLM-based validators for qualitative ones. Using this framework, we construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 15 constraint types. Experimental results demonstrate that models fine-tuned on RECAST-30K show substantial improvements in following complex instructions. Moreover, the verifiability provided by RECAST enables the design of reward functions for reinforcement learning, which further boosts model performance on complex and challenging tasks.

摘要

随着大语言模型（LLMs）应用范围的扩大以及用户编写复杂提示能力的提升，它们被越来越多地要求处理复杂任务。然而，当显式声明的需求数量增加（尤其是超过10个约束条件时），LLMs往往难以准确遵循此类复杂指令。为解决这一挑战，我们提出了RECAST框架，该框架通过合成数据集使每个样本包含远超现有基准的约束条件。这些约束从真实世界的提示-响应对中提取，以确保实际相关性。RECAST支持通过基于规则的验证器自动检验定量约束的满足情况，并利用基于LLM的验证器检验定性约束。借助该框架，我们构建了RECAST-30K数据集——一个包含15种约束类型、规模达3万实例的大规模高质量数据集。实验结果表明，基于RECAST-30K微调的模型在遵循复杂指令方面表现出显著提升。此外，RECAST提供的可验证性为强化学习的奖励函数设计提供了支持，从而进一步提高了模型在复杂挑战性任务上的性能。

Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

Abstract

arXiv:2505.18955v1 Announce Type: new Abstract: Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.

摘要

受通用大语言模型（LLM）在软件补丁生成领域成功的启发，近期研究开始训练专用补丁生成模型。现有工作大多训练单一模型处理端到端补丁流程（包括问题定位、补丁生成和补丁验证）。然而，小型模型难以胜任所有子任务，因为不同子任务具有差异化的工作流程和专业知识需求。因此，当前最佳方法使用700亿参数模型时，在SWE-bench-Verified基准上仅能达到41%的修复率。基于协作机制的思想，我们提出首个协作式补丁系统Co-PatcheR，该系统采用小型专用推理模型分别处理各组件任务。我们的核心技术创新在于特定任务设计与训练方案：首先训练定位与补丁生成联合模型，其中定位模块通过两级流程精确定位可疑代码行，生成模块整合补丁生成与批判式改进；随后提出混合补丁验证机制，包含两个模型分别用于生成带断言/不带断言的问题复现测试用例、判断补丁正确性，最终基于多数表决机制选择补丁。大量实验表明，Co-PatcheR仅使用3个140亿参数模型即在SWE-bench-Verified上实现46%的修复率，成为专用模型中性能最佳、训练资源需求最低且模型尺寸最小的补丁系统。我们通过全面消融实验验证了训练方案的有效性，以及对训练数据量、模型规模和测试阶段扩展策略的选择依据。

OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs

Abstract

arXiv:2505.19165v1 Announce Type: new Abstract: Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet under explored, challenge emerges: \textit{can these models reliably understand and operate within the complex, often nuanced, constraints imposed by organizational hierarchies and associated permissions?} Evaluating this crucial capability is inherently difficult due to the proprietary and sensitive nature of real-world corporate data and access control policies. We introduce a synthetic yet representative \textbf{OrgAccess} benchmark consisting of 40 distinct types of permissions commonly relevant across different organizational roles and levels. We further create three types of permissions: 40,000 easy (1 permission), 10,000 medium (3-permissions tuple), and 20,000 hard (5-permissions tuple) to test LLMs' ability to accurately assess these permissions and generate responses that strictly adhere to the specified hierarchical rules, particularly in scenarios involving users with overlapping or conflicting permissions. Our findings reveal that even state-of-the-art LLMs struggle significantly to maintain compliance with role-based structures, even with explicit instructions, with their performance degrades further when navigating interactions involving two or more conflicting permissions. Specifically, even \textbf{GPT-4.1 only achieves an F1-Score of 0.27 on our hardest benchmark}. This demonstrates a critical limitation in LLMs' complex rule following and compositional reasoning capabilities beyond standard factual or STEM-based benchmarks, opening up a new paradigm for evaluating their fitness for practical, structured environments.

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Abstract

arXiv:2505.19099v1 Announce Type: new Abstract: We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

摘要

我们推出SeePhys——一个基于从中学到博士资格考试物理问题的大规模多模态基准测试，用于评估大语言模型的物理推理能力。该基准涵盖物理学7个基础领域，包含21类高度异质化的图表。与先前研究中视觉元素主要起辅助作用不同，我们的基准测试中视觉关键问题占比高达75%，这类问题必须通过视觉信息提取才能获得正确答案。通过广泛评估发现，即使最先进的视觉推理模型（如Gemini-2.5-pro和o4-mini）在本基准上的准确率也不足60%。这些结果揭示了当前大语言模型在视觉理解能力上存在根本性挑战，主要体现在：(i) 难以建立图表解析与物理推理之间的严格耦合关系；(ii) 无法克服对文本线索作为认知捷径的持续依赖。

Reinforced Latent Reasoning for LLM-based Recommendation

Abstract

arXiv:2505.19092v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as a small set of latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose $\textit{\underline{R}einforced \underline{Latent} \underline{R}easoning for \underline{R}ecommendation}$ (LatentR $^3$ ), a novel end-to-end training framework that leverages reinforcement learning (RL) to optimize latent reasoning without relying on any CoT data.LatentR $^3$ adopts a two-stage training strategy: first, supervised fine-tuning to initialize the latent reasoning module, followed by pure RL training to encourage exploration through a rule-based reward design. Our RL implementation is based on a modified GRPO algorithm, which reduces computational overhead during training and introduces continuous reward signals for more efficient learning. Extensive experiments demonstrate that LatentR $^3$ enables effective latent reasoning without any direct supervision of the reasoning process, significantly improving performance when integrated with different LLM-based recommendation methods. Our codes are available at https://anonymous.4open.science/r/R3-A278/.

摘要

大型语言模型（LLMs）在复杂问题解决任务中展现出卓越的推理能力，这激发了人们对其在推荐系统中偏好推理应用的日益关注。现有方法通常依赖于显式思维链（CoT）数据的微调，但这些方法面临两大实际限制：（1）难以获取高质量的推荐领域CoT数据；（2）生成CoT推理导致的高推理延迟。本研究探索了一种替代方案，将显式CoT推理转向紧凑、信息密集的潜在推理。该方法无需生成显式CoT，并通过少量潜在令牌即可完整捕获推理过程，从而提升推理效率。基于此，我们提出《推荐系统中的强化潜在推理》（LatentR³）——一种端到端训练框架，利用强化学习（RL）优化潜在推理且不依赖任何CoT数据。LatentR³采用两阶段训练策略：首先通过监督微调初始化潜在推理模块，再通过基于规则的奖励设计进行纯RL训练以促进探索。我们的RL实现基于改进的GRPO算法，可降低训练计算开销并提供连续奖励信号以实现高效学习。大量实验表明，LatentR³能在无任何推理过程直接监督的情况下实现有效潜在推理，当与不同基于LLM的推荐方法结合时显著提升性能。代码发布于https://anonymous.4open.science/r/R3-A278/。

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

Abstract

arXiv:2505.19095v1 Announce Type: new Abstract: The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.

摘要

大语言模型（LLM）的快速发展引发了在图形用户界面（GUI）环境中构建人工通用智能（AGI）的日益增长的兴趣。然而，现有基于LLM或视觉语言模型（VLM）的GUI智能体往往难以泛化至新环境，且严重依赖人工整理的多样化数据集。为克服这些局限，我们提出了ScreenExplorer——一种通过群体相对策略优化（GRPO）在真实、动态且开放式的GUI环境中训练的VLM。创新性地，我们引入了基于世界模型的好奇心奖励函数，以帮助智能体克服探索的冷启动阶段。此外，经验流的蒸馏进一步增强了模型的探索能力。我们的训练框架提升了模型在开放GUI环境中的探索性能，与静态部署模型相比，经过训练的模型展现出更好的环境适应性与持续探索能力。本研究为复杂交互场景中具有自我提升能力的AGI系统提供了一条可扩展的发展路径。

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Abstract

arXiv:2505.19075v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms \add{existing baseline fine-tuning methods using the Llama3.2 model}. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR

摘要

大语言模型（LLMs）已展现出卓越的通用能力，但提升推理等专项技能通常需要大量计算资源，并可能削弱其泛化性能。虽然参数高效微调（PEFT）方法提供了更节约资源的替代方案，但由于架构依赖性，这些方法通常需要针对每个LLM主干进行重新训练。为解决这些问题，本文提出通用推理器（UniR）——一个轻量级、可组合、即插即用的独立推理模块，可与任何冻结的LLM结合使用以赋予其专业推理能力。具体而言，UniR将奖励函数解耦为独立的推理模块，通过预定义奖励进行独立训练，从而将轨迹级信号有效转化为词元级指导。训练完成后，UniR只需将其输出逻辑值与LLM主干的逻辑值相加，即可在推理阶段与任意冻结的LLM结合使用。这种可加性结构天然支持模块化组合：针对不同任务训练的多个UniR模块可通过逻辑值求和实现联合应用，从而通过组合完成复杂推理。在数学推理和机器翻译任务上的实验表明，UniR显著优于使用Llama3.2模型的现有基线微调方法。此外，UniR展现出强大的弱到强泛化能力：基于较小模型训练的推理模块能有效指导规模更大的LLMs。这使得UniR成为一种在不损害核心能力的前提下，提升LLM推理能力的经济高效、适应性强且稳健的解决方案。

CardioCoT: Hierarchical Reasoning for Multimodal Survival Analysis

Abstract

arXiv:2505.19195v1 Announce Type: new Abstract: Accurate prediction of major adverse cardiovascular events recurrence risk in acute myocardial infarction patients based on postoperative cardiac MRI and associated clinical notes is crucial for precision treatment and personalized intervention. Existing methods primarily focus on risk stratification capability while overlooking the need for intermediate robust reasoning and model interpretability in clinical practice. Moreover, end-to-end risk prediction using LLM/VLM faces significant challenges due to data limitations and modeling complexity. To bridge this gap, we propose CardioCoT, a novel two-stage hierarchical reasoning-enhanced survival analysis framework designed to enhance both model interpretability and predictive performance. In the first stage, we employ an evidence-augmented self-refinement mechanism to guide LLM/VLMs in generating robust hierarchical reasoning trajectories based on associated radiological findings. In the second stage, we integrate the reasoning trajectories with imaging data for risk model training and prediction. CardioCoT demonstrates superior performance in MACE recurrence risk prediction while providing interpretable reasoning processes, offering valuable insights for clinical decision-making.

摘要

基于术后心脏磁共振成像及相关临床记录，准确预测急性心肌梗死患者主要不良心血管事件复发风险对于精准治疗和个性化干预至关重要。现有方法主要关注风险分层能力，而忽视了临床实践中对中间稳健推理和模型可解释性的需求。此外，由于数据限制和建模复杂性，使用LLM/VLM进行端到端风险预测面临重大挑战。为弥补这一空白，我们提出CardioCoT——一个新颖的两阶段分层推理增强生存分析框架，旨在同时提升模型可解释性和预测性能。第一阶段采用证据增强的自优化机制，引导LLM/VLM基于相关放射学发现生成稳健的分层推理轨迹；第二阶段将推理轨迹与影像数据整合进行风险模型训练与预测。CardioCoT在MACE复发风险预测中展现出卓越性能，同时提供可解释的推理过程，为临床决策提供宝贵见解。

Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance

Abstract

arXiv:2505.19197v1 Announce Type: new Abstract: Extracting structured and quantitative insights from unstructured financial filings is essential in investment research, yet remains time-consuming and resource-intensive. Conventional approaches in practice rely heavily on labor-intensive manual processes, limiting scalability and delaying the research workflow. In this paper, we propose an efficient and scalable method for accurately extracting quantitative insights from unstructured financial documents, leveraging a multi-agent system composed of large language models. Our proposed multi-agent system consists of two specialized agents: the \emph{Extraction Agent} and the \emph{Text-to-SQL Agent}. The \textit{Extraction Agent} automatically identifies key performance indicators from unstructured financial text, standardizes their formats, and verifies their accuracy. On the other hand, the \textit{Text-to-SQL Agent} generates executable SQL statements from natural language queries, allowing users to access structured data accurately without requiring familiarity with the database schema. Through experiments, we demonstrate that our proposed system effectively transforms unstructured text into structured data accurately and enables precise retrieval of key information. First, we demonstrate that our system achieves approximately 95% accuracy in transforming financial filings into structured data, matching the performance level typically attained by human annotators. Second, in a human evaluation of the retrieval task -- where natural language queries are used to search information from structured data -- 91% of the responses were rated as correct by human evaluators. In both evaluations, our system generalizes well across financial document types, consistently delivering reliable performance.

摘要

从非结构化财务文件中提取结构化定量洞察对投资研究至关重要，但这一过程仍耗时且资源密集。传统实践方法严重依赖劳动密集型人工处理，限制了可扩展性并延缓研究流程。本文提出一种高效可扩展的方法，通过基于大语言模型的多智能体系统，从非结构化财务文档中准确提取定量信息。我们设计的多智能体系统包含两个专用代理：提取代理和文本转SQL代理。提取代理能自动识别非结构化财务文本中的关键绩效指标，标准化其格式并验证准确性；而文本转SQL代理可将自然语言查询转换为可执行SQL语句，使用户无需了解数据库模式即可准确访问结构化数据。实验表明，本系统能有效将非结构化文本准确转化为结构化数据并实现关键信息的精准检索。首先，系统在财务文件结构化转换中达到约95%的准确率，与人工标注水平相当；其次，在基于自然语言查询的结构化数据检索任务的人为评估中，91%的响应被评估者判定为正确。两项评估均显示，本系统对不同类型财务文档具有良好的泛化能力，能持续提供可靠性能。

Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval Augmented Generation Across Learning Style

Abstract

arXiv:2505.19173v1 Announce Type: new Abstract: Effective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students, a persistent challenge in education and teacher training. While Large Language Models (LLMs) offer promise as tools to simulate such complex pedagogical environments, current simulation frameworks are limited in two key respects: (1) they often reduce students to static knowledge profiles, and (2) they lack adaptive mechanisms for modeling teachers who evolve their strategies in response to student feedback. To address these gaps, \textbf{we introduce a novel simulation framework that integrates LLM-based heterogeneous student agents with a self-optimizing teacher agent}. The teacher agent's pedagogical policy is dynamically evolved using a genetic algorithm, allowing it to discover and refine effective teaching strategies based on the aggregate performance of diverse learners. In addition, \textbf{we propose Persona-RAG}, a Retrieval Augmented Generation module that enables student agents to retrieve knowledge tailored to their individual learning styles. Persona-RAG preserves the retrieval accuracy of standard RAG baselines while enhancing personalization, an essential factor in modeling realistic educational scenarios. Through extensive experiments, we demonstrate how our framework supports the emergence of distinct and interpretable teaching patterns when interacting with varied student populations. Our results highlight the potential of LLM-driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in controlled, data-driven environments.

摘要

有效教学需要调整教学策略以适应学生多样化的认知和行为特征，这是教育及教师培训中长期存在的挑战。虽然大语言模型（LLMs）作为模拟此类复杂教学环境的工具展现出潜力，但现有仿真框架存在两个关键局限：（1）通常将学生简化为静态知识图谱；（2）缺乏建模教师根据学生反馈动态调整策略的适应机制。为弥补这些不足，我们提出了一种新型仿真框架，该框架整合了基于LLM的异构学生智能体与自优化教师智能体。教师智能体的教学策略通过遗传算法动态进化，使其能根据多样化学习者的整体表现发现并优化教学策略。此外，我们提出Persona-RAG——一个检索增强生成模块，使学生智能体能获取符合其个性化学习风格的知识。该模块在保持标准RAG基线检索准确性的同时增强了个性化程度，这是构建真实教育场景模型的关键要素。通过大量实验，我们展示了该框架如何在与不同学生群体互动时形成独特且可解释的教学模式。研究结果凸显了LLM驱动仿真在指导适应性教学实践方面的潜力，并为在受控的数据驱动环境中培训人类教师提供了实验平台。

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

Abstract

arXiv:2505.19234v1 Announce Type: new Abstract: The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi-turn dialogues. However, multi-agent collaboration face critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder-decoder architecture incorporating an incremental training paradigm, learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN's effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization.

摘要

大型语言模型（LLMs）的出现使得能够开发出参与复杂多轮对话的智能体。然而，多智能体协作面临关键的安全挑战，如幻觉放大以及错误注入与传播。本文提出GUARDIAN，一种用于检测和缓解智能体协作中多种安全问题的统一方法，通过将多智能体协作过程建模为离散时间时序属性图，GUARDIAN显式捕获幻觉和错误的传播动态。采用无监督编码器-解码器架构并结合增量训练范式，该方法学习从潜在嵌入中重构节点属性和图结构，从而以极高精度识别异常节点和边。此外，我们引入基于信息瓶颈理论的图抽象机制，在压缩时序交互图的同时保留关键模式。大量实验证明，GUARDIAN在保护LLM多智能体协作抵御各类安全漏洞方面具有显著效果，以高效资源利用率实现了最先进的准确率。

Sensorimotor features of self-awareness in multimodal large language models

Abstract

arXiv:2505.19237v1 Announce Type: new Abstract: Self-awareness - the ability to distinguish oneself from the surrounding environment - underpins intelligent, autonomous behavior. Recent advances in AI achieve human-like performance in tasks integrating multimodal information, particularly in large language models, raising interest in the embodiment capabilities of AI agents on nonhuman platforms such as robots. Here, we explore whether multimodal LLMs can develop self-awareness solely through sensorimotor experiences. By integrating a multimodal LLM into an autonomous mobile robot, we test its ability to achieve this capacity. We find that the system exhibits robust environmental awareness, self-recognition and predictive awareness, allowing it to infer its robotic nature and motion characteristics. Structural equation modeling reveals how sensory integration influences distinct dimensions of self-awareness and its coordination with past-present memory, as well as the hierarchical internal associations that drive self-identification. Ablation tests of sensory inputs identify critical modalities for each dimension, demonstrate compensatory interactions among sensors and confirm the essential role of structured and episodic memory in coherent reasoning. These findings demonstrate that, given appropriate sensory information about the world and itself, multimodal LLMs exhibit emergent self-awareness, opening the door to artificial embodied cognitive systems.

摘要

自我意识——即区分自身与周围环境的能力——是智能自主行为的基石。近期人工智能在多模态信息整合任务中（尤其是大语言模型）实现了类人性能，引发了人们对机器人等非人类平台上AI智能体具身能力的兴趣。本研究探讨多模态大语言模型能否仅通过感觉运动经验发展自我意识。通过将多模态大语言模型集成至自主移动机器人，我们测试了其实现该能力的可能性。研究发现该系统展现出强大的环境感知、自我识别和预测性意识，使其能够推断自身的机器人属性和运动特征。结构方程模型揭示了感觉整合如何影响自我意识的不同维度及其与过去-现在记忆的协调，以及驱动自我识别的层级化内部关联。感官输入的消融实验确定了各维度的关键模态，证明了传感器间的补偿性相互作用，并验证了结构化记忆与情景记忆在连贯推理中的核心作用。这些发现表明，当获得关于世界和自身的适当感官信息时，多模态大语言模型会呈现出涌现的自我意识，为人工具身认知系统的发展开启了新途径。

ODIN: A NL2SQL Recommender to Handle Schema Ambiguity

Abstract

arXiv:2505.19302v1 Announce Type: new Abstract: NL2SQL (natural language to SQL) systems translate natural language into SQL queries, allowing users with no technical background to interact with databases and create tools like reports or visualizations. While recent advancements in large language models (LLMs) have significantly improved NL2SQL accuracy, schema ambiguity remains a major challenge in enterprise environments with complex schemas, where multiple tables and columns with semantically similar names often co-exist. To address schema ambiguity, we introduce ODIN, a NL2SQL recommendation engine. Instead of producing a single SQL query given a natural language question, ODIN generates a set of potential SQL queries by accounting for different interpretations of ambiguous schema components. ODIN dynamically adjusts the number of suggestions based on the level of ambiguity, and ODIN learns from user feedback to personalize future SQL query recommendations. Our evaluation shows that ODIN improves the likelihood of generating the correct SQL query by 1.5-2 $\times$ compared to baselines.

摘要

NL2SQL(自然语言转SQL)系统将自然语言转换为SQL查询，使非技术背景用户能够与数据库交互并创建报表或可视化等工具。尽管大语言模型(LLM)的最新进展显著提升了NL2SQL的准确率，但在具有复杂模式的企业环境中，模式歧义仍是主要挑战——这些环境中常存在多个表及语义相似的列名共存的情况。为解决模式歧义问题，我们提出了ODIN，一个NL2SQL推荐引擎。ODIN不针对自然语言问题生成单一SQL查询，而是通过考虑歧义模式组件的不同解释，生成一组潜在SQL查询。ODIN根据歧义程度动态调整建议数量，并通过学习用户反馈来个性化未来的SQL查询推荐。评估表明，与基线相比，ODIN将生成正确SQL查询的概率提高了1.5-2倍。

Evaluating Steering Techniques using Human Similarity Judgments

Abstract

arXiv:2505.19333v1 Announce Type: new Abstract: Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.

摘要

当前对大型语言模型（LLM）引导技术的评估主要关注任务特定性能，而忽视了被引导的表征与人类认知的契合程度。本研究通过成熟的三元相似性判断任务，评估了受引导LLM在基于'尺寸'或'类别'灵活判断概念相似性的能力。研究发现，基于提示的引导方法在引导准确性和模型-人类对齐度方面均优于其他方法。同时发现LLM存在偏向'类别'相似性的偏见，且在'尺寸'维度上难以实现对齐。这种基于人类认知的评估方法，不仅进一步验证了基于提示的引导技术有效性，还揭示了LLM在未经引导前就存在的表征轴偏好。

Using Large Language Models to Assess Teachers' Pedagogical Content Knowledge

Abstract

arXiv:2505.19266v1 Announce Type: new Abstract: Assessing teachers' pedagogical content knowledge (PCK) through performance-based tasks is both time and effort-consuming. While large language models (LLMs) offer new opportunities for efficient automatic scoring, little is known about whether LLMs introduce construct-irrelevant variance (CIV) in ways similar to or different from traditional machine learning (ML) and human raters. This study examines three sources of CIV -- scenario variability, rater severity, and rater sensitivity to scenario -- in the context of video-based constructed-response tasks targeting two PCK sub-constructs: analyzing student thinking and evaluating teacher responsiveness. Using generalized linear mixed models (GLMMs), we compared variance components and rater-level scoring patterns across three scoring sources: human raters, supervised ML, and LLM. Results indicate that scenario-level variance was minimal across tasks, while rater-related factors contributed substantially to CIV, especially in the more interpretive Task II. The ML model was the most severe and least sensitive rater, whereas the LLM was the most lenient. These findings suggest that the LLM contributes to scoring efficiency while also introducing CIV as human raters do, yet with varying levels of contribution compared to supervised ML. Implications for rater training, automated scoring design, and future research on model interpretability are discussed.

摘要

评估教师的学科教学知识（PCK）通过基于表现的任务既耗时又费力。尽管大语言模型（LLM）为高效自动评分提供了新机遇，但目前尚不清楚LLM是否会在与传统机器学习（ML）和人类评分者相似或不同的方式下引入构念无关变异（CIV）。本研究在基于视频的建构反应任务背景下，考察了三种CIV来源——情境变异性、评分者严厉度以及评分者对情境的敏感性——这些任务针对PCK的两个子构念：分析学生思维和评估教师回应能力。通过广义线性混合模型（GLMM），我们比较了三种评分来源（人类评分者、监督式ML和LLM）的方差成分和评分者层面的评分模式。结果显示，跨任务的情境水平方差极小，而评分者相关因素对CIV贡献显著，尤其在更具解释性的任务II中。ML模型是最严厉且敏感性最低的评分者，而LLM则最为宽松。这些发现表明，LLM在提升评分效率的同时，也像人类评分者一样引入了CIV，但其贡献程度与监督式ML有所不同。研究还讨论了对评分者培训、自动评分设计以及未来模型可解释性研究的启示。

Abstract

arXiv:2505.19442v1 Announce Type: new Abstract: Controllable code generation, the ability to synthesize code that follows a specified style while maintaining functionality, remains a challenging task. We propose a two-stage training framework combining contrastive learning and conditional decoding to enable flexible style control. The first stage aligns code style representations with semantic and structural features. In the second stage, we fine-tune a language model (e.g., Flan-T5) conditioned on the learned style vector to guide generation. Our method supports style interpolation and user personalization via lightweight mixing. Compared to prior work, our unified framework offers improved stylistic control without sacrificing code correctness. This is among the first approaches to combine contrastive alignment with conditional decoding for style-guided code generation.

摘要

可控代码生成是指在保持功能性的同时合成符合特定风格代码的能力，这仍是一项具有挑战性的任务。我们提出了一种结合对比学习和条件解码的两阶段训练框架，以实现灵活的风格控制。第一阶段将代码风格表示与语义和结构特征对齐。第二阶段，我们基于学习到的风格向量对语言模型（如Flan-T5）进行微调以指导生成。我们的方法通过轻量级混合支持风格插值和用户个性化定制。与现有工作相比，该统一框架在不牺牲代码正确性的前提下提供了更好的风格控制能力。这是首个将对比对齐与条件解码相结合来实现风格引导代码生成的方法之一。

Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation

Abstract

arXiv:2505.19353v1 Announce Type: new Abstract: With the rise of generative AI (GenAI), Large Language Models are increasingly employed for code generation, becoming active co-authors alongside human programmers. Focusing specifically on this application domain, this paper articulates distinct ``Architectures of Error'' to ground an epistemic distinction between human and machine code generation. Examined through their shared vulnerability to error, this distinction reveals fundamentally different causal origins: human-cognitive versus artificial-stochastic. To develop this framework and substantiate the distinction, the analysis draws critically upon Dennett's mechanistic functionalism and Rescher's methodological pragmatism. I argue that a systematic differentiation of these error profiles raises critical philosophical questions concerning semantic coherence, security robustness, epistemic limits, and control mechanisms in human-AI collaborative software development. The paper also utilizes Floridi's levels of abstraction to provide a nuanced understanding of how these error dimensions interact and may evolve with technological advancements. This analysis aims to offer philosophers a structured framework for understanding GenAI's unique epistemological challenges, shaped by these architectural foundations, while also providing software engineers a basis for more critically informed engagement.

摘要

随着生成式人工智能（GenAI）的兴起，大语言模型日益被用于代码生成，成为与人类程序员并肩的活跃合著者。本文聚焦这一特定应用领域，提出独特的"错误架构"理论框架，以确立人类与机器在代码生成层面的认知差异。通过分析二者共有的错误脆弱性，这种差异揭示了根本不同的因果起源：人类认知型错误与人工随机型错误。为构建该框架并验证其区分效度，本研究批判性借鉴了丹尼特的机械功能主义与雷谢尔的方法论实用主义。笔者认为，系统区分这两类错误模式将引发关于人机协作软件开发中语义连贯性、安全鲁棒性、认知边界及控制机制等关键哲学问题。本文还运用弗洛里迪的抽象层级理论，对这些错误维度的交互作用及其可能随技术发展的演变路径进行了精细化阐释。该分析旨在为哲学家提供理解生成式人工智能独特认识论挑战的结构化框架（这些挑战由上述架构基础所塑造），同时为软件工程师开展更具批判性的实践提供理论基础。

CaseEdit: Enhancing Localized Commonsense Reasoning via Null-Space Constrained Knowledge Editing in Small Parameter Language Models

Abstract

arXiv:2505.19383v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong performance on factual recall and general reasoning but struggle to adapt to user-specific, commonsense knowledge, a challenge particularly acute in small-parameter settings where computational efficiency is prioritized. We introduce CaseEdit, a new dataset and generation pipeline for evaluating localized, personalized commonsense knowledge editing in small LLMs to address this. Built upon the ATOMIC20/20 commonsense graph, CaseEdit uses a multi-stage inference process to generate both typical and atypical contextual edits for household objects, paired with targeted evaluation questions across four axes: reliability, generalization, locality, and portability. We evaluate established knowledge editing methods using CaseEdit and demonstrate that AlphaEdit, a technique employing null-space projection to minimize interference with unrelated knowledge, consistently outperforms other methods when applied to an LLaMA 3.2 3B model, even in scalability tests, showing minimal ripple effects. Our results indicate that using CaseEdit with effective editing techniques like AlphaEdit allows small models to internalize high-quality, context-sensitive common-sense knowledge, paving the way for lightweight, personalized assistants.

摘要

大语言模型（LLMs）在事实回忆和通用推理方面表现优异，但难以适应用户特定的常识知识，这一挑战在优先考虑计算效率的小参数量场景中尤为突出。为此，我们提出CaseEdit——一个用于评估小型LLMs中局部化、个性化常识知识编辑的新数据集与生成流程。该工作基于ATOMIC20/20常识图谱，通过多阶段推理过程生成家用物品的典型与非典型上下文编辑内容，并配套针对可靠性、泛化性、局部性和可迁移性四个维度的评估问题。我们使用CaseEdit评估现有知识编辑方法，结果表明：采用零空间投影技术以最小化无关知识干扰的AlphaEdit方法，在LLaMA 3.2 3B模型上持续优于其他方法，即使在可扩展性测试中也仅产生微小涟漪效应。研究证实，通过CaseEdit与AlphaEdit等高效编辑技术结合，可使小模型内化高质量、上下文敏感的常识知识，为轻量级个性化助手的发展铺平道路。

Recalibrating the Compass: Integrating Large Language Models into Classical Research Methods

Abstract

arXiv:2505.19402v1 Announce Type: new Abstract: This paper examines how large language models (LLMs) are transforming core quantitative methods in communication research in particular, and in the social sciences more broadly-namely, content analysis, survey research, and experimental studies. Rather than replacing classical approaches, LLMs introduce new possibilities for coding and interpreting text, simulating dynamic respondents, and generating personalized and interactive stimuli. Drawing on recent interdisciplinary work, the paper highlights both the potential and limitations of LLMs as research tools, including issues of validity, bias, and interpretability. To situate these developments theoretically, the paper revisits Lasswell's foundational framework -- "Who says what, in which channel, to whom, with what effect?" -- and demonstrates how LLMs reconfigure message studies, audience analysis, and effects research by enabling interpretive variation, audience trajectory modeling, and counterfactual experimentation. Revisiting the metaphor of the methodological compass, the paper argues that classical research logics remain essential as the field integrates LLMs and generative AI. By treating LLMs not only as technical instruments but also as epistemic and cultural tools, the paper calls for thoughtful, rigorous, and imaginative use of LLMs in future communication and social science research.

摘要

本文探讨了大型语言模型(LLM)如何变革传播学研究乃至更广泛社会科学领域的核心定量方法，特别是内容分析、调查研究和实验研究。LLM并非取代传统方法，而是为文本编码与解释、动态受访者模拟以及个性化交互式刺激生成提供了新的可能性。基于近期跨学科研究成果，本文既强调了LLM作为研究工具的潜力，也指出了其在效度、偏差和可解释性等方面的局限。为从理论层面定位这些发展，本文重新审视了拉斯韦尔的基础框架——'谁通过什么渠道向谁说了什么并产生什么效果？'，并论证LLM如何通过实现解释变异、受众轨迹建模和反事实实验，重构了信息研究、受众分析和效果研究。通过重温方法论罗盘的隐喻，本文指出在整合LLM和生成式AI的过程中，经典研究逻辑仍然不可或缺。通过将LLM不仅视为技术工具，更作为认知与文化工具，本文呼吁在未来传播学与社会科学研究中以深思熟虑、严谨且富有想象力的方式运用LLM。

Origin Tracer: A Method for Detecting LoRA Fine-Tuning Origins in LLMs

Abstract

arXiv:2505.19466v1 Announce Type: new Abstract: As large language models (LLMs) continue to advance, their deployment often involves fine-tuning to enhance performance on specific downstream tasks. However, this customization is sometimes accompanied by misleading claims about the origins, raising significant concerns about transparency and trust within the open-source community. Existing model verification techniques typically assess functional, representational, and weight similarities. However, these approaches often struggle against obfuscation techniques, such as permutations and scaling transformations. To address this limitation, we propose a novel detection method Origin-Tracer that rigorously determines whether a model has been fine-tuned from a specified base model. This method includes the ability to extract the LoRA rank utilized during the fine-tuning process, providing a more robust verification framework. This framework is the first to provide a formalized approach specifically aimed at pinpointing the sources of model fine-tuning. We empirically validated our method on thirty-one diverse open-source models under conditions that simulate real-world obfuscation scenarios. We empirically analyze the effectiveness of our framework and finally, discuss its limitations. The results demonstrate the effectiveness of our approach and indicate its potential to establish new benchmarks for model verification.

摘要

随着大语言模型（LLM）的持续发展，其部署通常涉及针对特定下游任务的微调以提升性能。然而这种定制化过程时常伴随关于模型来源的误导性声明，引发了开源社区对透明度和信任的严重关切。现有模型验证技术主要评估功能、表征和权重层面的相似性，但这些方法往往难以应对置换和尺度变换等混淆技术。为突破这一局限，我们提出了一种新型检测方法Origin-Tracer，该方法能严格判定模型是否基于指定基础模型进行过微调，包括提取微调过程中使用的LoRA秩，从而构建更鲁棒的验证框架。该框架首次提供了专门用于追溯模型微调来源的形式化方法。我们在模拟真实混淆场景的条件下，对三十一个多样化开源模型进行了实证验证，分析了框架的有效性并探讨了其局限性。实验结果证明了本方法的有效性，并显示出其有望为模型验证建立新基准的潜力。

Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions

Abstract

arXiv:2505.19501v1 Announce Type: new Abstract: In this short report, we present an automated pipeline tailored for the genomics domain and introduce \textit{Genome-Bench}, a new benchmark constructed from over a decade of scientific forum discussions on genome engineering. Our pipeline transforms raw interactions into a reinforcement learning friendly multiple-choice questions format, supported by 3000+ high quality question answer pairs spanning foundational biology, experimental troubleshooting, tool usage, and beyond. To our knowledge, this is the first end-to-end pipeline for teaching LLMs to reason from scientific discussions, with promising potential for generalization across scientific domains beyond biology.

摘要

在这份简短报告中，我们提出了一个专为基因组学领域设计的自动化流程，并介绍了Genome-Bench——一个基于十余年基因组工程科学论坛讨论构建的新型基准测试。该流程将原始互动数据转化为适合强化学习的多选题形式，包含3000余个涵盖基础生物学、实验故障排除、工具使用等方面的高质量问答对。据我们所知，这是首个教导大语言模型从科学讨论中推理的端到端流程，在生物学之外的其他科学领域也具有广阔的推广潜力。

Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models

Abstract

arXiv:2505.19474v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual understanding tasks, yet they often suffer from object hallucinations--generating descriptions of objects that are inconsistent with or entirely absent from the input. This issue is closely related to dataset biases, where frequent co-occurrences of objects lead to entangled semantic representations across modalities. As a result, models may erroneously activate object representations that are commonly associated with the input but not actually present. To address this, we propose a causality-driven disentanglement framework that mitigates hallucinations through causal intervention. Our approach includes a Causal-Driven Projector in the visual pathway and a Causal Intervention Module integrated into the final transformer layer of the language model. These components work together to reduce spurious correlations caused by biased training data. Experimental results show that our method significantly reduces hallucinations while maintaining strong performance on multiple multimodal benchmarks. Visualization analyses further confirm improved separability of object representations. The code is available at: https://github.com/IgniSavium/Causal-LLaVA

摘要

多模态大语言模型（MLLMs）在视觉理解任务中展现出强大性能，但普遍存在物体幻觉问题——生成与输入内容不符或完全不存在的物体描述。该问题与数据集偏差密切相关，即物体频繁共现导致跨模态语义表征纠缠，使得模型可能错误激活与输入常见关联但实际未出现的物体表征。

为此，我们提出一种因果驱动的解耦框架，通过因果干预缓解幻觉现象。该方法在视觉通路中引入因果驱动投影器，并在语言模型最终Transformer层集成因果干预模块，协同降低有偏训练数据导致的虚假相关性。

实验结果表明，本方法在保持多模态基准性能的同时显著减少幻觉现象。可视化分析进一步证实物体表征可分离性得到提升。代码发布于：https://github.com/IgniSavium/Causal-LLaVA

Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model

Abstract

arXiv:2505.19406v1 Announce Type: new Abstract: While large language models (LLMs) demonstrate strong reasoning capabilities utilizing reinforcement learning (RL) with verifiable reward, whether large vision-language models (VLMs) can directly inherit such capabilities through similar post-training strategies remains underexplored. In this work, we conduct a systematic compositional probing study to evaluate whether current VLMs trained with RL or other post-training strategies can compose capabilities across modalities or tasks under out-of-distribution conditions. We design a suite of diagnostic tasks that train models on unimodal tasks or isolated reasoning skills, and evaluate them on multimodal, compositional variants requiring skill integration. Through comparisons between supervised fine-tuning (SFT) and RL-trained models, we identify three key findings: (1) RL-trained models consistently outperform SFT on compositional generalization, demonstrating better integration of learned skills; (2) although VLMs achieve strong performance on individual tasks, they struggle to generalize compositionally under cross-modal and cross-task scenario, revealing a significant gap in current training strategies; (3) enforcing models to explicitly describe visual content before reasoning (e.g., caption-before-thinking), along with rewarding progressive vision-to-text grounding, yields notable gains. It highlights two essential ingredients for improving compositionality in VLMs: visual-to-text alignment and accurate visual grounding. Our findings shed light on the current limitations of RL-based reasoning VLM training and provide actionable insights toward building models that reason compositionally across modalities and tasks.

摘要

尽管大语言模型（LLMs）通过可验证奖励的强化学习（RL）展现出强大的推理能力，但大视觉语言模型（VLMs）能否通过类似的后训练策略直接继承这种能力仍待探索。本研究通过系统性组合探针实验，评估当前采用RL或其他后训练策略的VLMs在分布外条件下能否跨模态或跨任务组合能力。我们设计了一套诊断任务，使模型在单模态任务或孤立推理技能上训练，并在需要技能整合的多模态组合变体上测试。通过对比监督微调（SFT）与RL训练模型，发现三个关键结论：（1）RL训练模型在组合泛化上持续优于SFT，表现出更好的技能整合能力；（2）尽管VLMs在单项任务上表现优异，但在跨模态和跨任务的组合泛化中存在显著困难，揭示了当前训练策略的不足；（3）强制模型在推理前显式描述视觉内容（如'描述-再思考'策略）并奖励渐进式视觉-文本 grounding 能带来显著提升。这凸显了提升VLM组合性的两个关键要素：视觉-文本对齐与精确的视觉 grounding。我们的发现揭示了当前基于RL的VLM推理训练的局限性，并为构建跨模态和跨任务组合推理模型提供了可行方向。

Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

Abstract

arXiv:2505.19436v1 Announce Type: new Abstract: Large Language Models (LLMs) falter in multi-step interactions -- often hallucinating, repeating actions, or misinterpreting user corrections -- due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph -- either a tree or directed acyclic graph (DAG) -- to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing -- TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME's modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME's codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME's scalable architecture addresses a critical gap in agent performance across complex, interactive settings.

摘要

大型语言模型（LLMs）在多步交互中存在明显缺陷——常出现幻觉、重复操作或误解用户修正——这源于其对线性非结构化上下文的依赖。这种脆弱性是由于缺乏持续记忆来追踪动态目标和任务依赖关系，从而削弱了自主代理的可信度。我们提出任务记忆引擎（TME），一种模块化记忆控制器，无需微调即可将现有LLMs转化为具备修订感知能力的鲁棒代理。TME采用空间记忆框架，用基于图的结构取代扁平化上下文，以支持连贯的多轮推理。不同于线性拼接和ReAct式提示，TME构建动态任务图（树状或有向无环图）来将用户输入映射至子任务，使其与先验上下文对齐，并实现依赖追踪的修订。其任务表征与意图管理（TRIM）组件通过建模任务语义和用户意图确保准确解析。在旅行规划、烹饪、会议安排和购物车编辑四个多轮场景中，TME在三个任务中完全消除幻觉和误解现象，并在27轮用户交互中整体减少66.7%的幻觉和83.3%的误判，性能超越ReAct。TME的模块化设计支持即插即用部署和领域定制，可适配个人助手与企业自动化场景。我们开源TME的代码库、基准测试及组件，助力研究者开发可靠LLM代理。该可扩展架构填补了复杂交互场景下代理性能的关键空白。

Judging with Many Minds: Do More Perspectives Mean Less Prejudice?

Abstract

arXiv:2505.19477v1 Announce Type: new Abstract: LLM-as-Judge has emerged as a scalable alternative to human evaluation, enabling large language models (LLMs) to provide reward signals in trainings. While recent work has explored multi-agent extensions such as multi-agent debate and meta-judging to enhance evaluation quality, the question of how intrinsic biases manifest in these settings remains underexplored. In this study, we conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias. We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge. Our results show that debate framework amplifies biases sharply after the initial debate, and this increased bias is sustained in subsequent rounds, while meta-judge approaches exhibit greater resistance. We further investigate the incorporation of PINE, a leading single-agent debiasing method, as a bias-free agent within these systems. The results reveal that this bias-free agent effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios. Our work provides a comprehensive study of bias behavior in multi-agent LLM-as-Judge systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation settings.

摘要

LLM-as-Judge（大语言模型作为评判者）已成为人类评估的可扩展替代方案，使大语言模型（LLMs）能够在训练中提供奖励信号。尽管近期研究探索了多智能体扩展（如多智能体辩论和元评判）以提升评估质量，但这些场景中内在偏见如何显现的问题仍未得到充分研究。在本研究中，我们对四种不同类型的偏见进行了系统分析：位置偏见、冗长偏见、思维链偏见和从众偏见。我们在两种广泛采用的多智能体LLM-as-Judge框架（多智能体辩论和LLM-as-元评判）中评估了这些偏见。结果表明，辩论框架在初始辩论后偏见急剧放大，且这种增加的偏见在后续轮次中持续存在，而元评判方法表现出更强的抵抗性。我们进一步研究了将领先的单智能体去偏方法PINE作为无偏见智能体引入这些系统的效果。结果显示，该无偏见智能体能有效减少辩论设置中的偏见，但在元评判场景中益处有限。本研究全面探讨了多智能体LLM-as-Judge系统中的偏见行为，并强调了在协作评估场景中需要针对性偏见缓解策略的重要性。

Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

Abstract

arXiv:2505.19489v1 Announce Type: new Abstract: The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL $^+$ , an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL $^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.

摘要

Linux内核作为支撑众多系统的关键基础设施，其缺陷可能导致影响数十亿用户的严重后果。故障定位（FL）技术通过识别软件中的缺陷代码元素，在质量保障中发挥着核心作用。尽管当前大语言模型智能体在SWE-bench等基准测试中展现出良好的FL准确率，但其在Linux内核中的表现尚不明确——由于代码规模庞大、可观测性受限及影响因素复杂，内核FL任务更具挑战性。本文提出LinuxFLBench，一个基于真实内核缺陷构建的FL基准测试集，并通过实证研究评估前沿大语言模型智能体在内核环境中的表现。实验结果表明，现有智能体在此任务中表现欠佳，文件级定位的最高top-1准确率仅为41.6%。为此，我们设计LinuxFL $^+$ 增强框架以提升LLM智能体在内核FL中的效能。该框架以极小成本显著提高了所有测试智能体的定位准确率（如7.2%-11.2%的提升幅度）。相关数据与代码已开源：https://github.com/FudanSELab/LinuxFLBench。

VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning

Abstract

arXiv:2505.19486v1 Announce Type: new Abstract: Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-branch reasoning. At the core of VLMLight is the first image-based traffic simulator that enables multi-view visual perception at intersections, allowing policies to reason over rich cues such as vehicle type, motion, and spatial density. A large language model (LLM) serves as a safety-prioritized meta-controller, selecting between a fast RL policy for routine traffic and a structured reasoning branch for critical cases. In the latter, multiple LLM agents collaborate to assess traffic phases, prioritize emergency vehicles, and verify rule compliance. Experiments show that VLMLight reduces waiting times for emergency vehicles by up to 65% over RL-only systems, while preserving real-time performance in standard conditions with less than 1% degradation. VLMLight offers a scalable, interpretable, and safety-aware solution for next-generation traffic signal control.

摘要

交通信号控制（TSC）是城市交通中的核心挑战，其实时决策需兼顾效率与安全性。现有方法——从基于规则的启发式到强化学习（RL）——往往难以泛化至复杂、动态且安全至上的场景。本文提出VLMLight，一种融合视觉语言元控制与双分支推理的新型TSC框架。其核心是首个基于图像的交通模拟器，可实现交叉路口的全景视觉感知，使策略能解析车辆类型、运动状态及空间密度等丰富信息。大型语言模型（LLM）作为安全优先的元控制器，在常规交通的快速RL策略与关键场景的结构化推理分支间动态切换。后者通过多LLM智能体协作，评估交通相位、优先调度应急车辆并验证规则合规性。实验表明，相较于纯RL系统，VLMLight将应急车辆等待时间缩短达65%，同时在标准条件下保持实时性能（延迟率低于1%）。该框架为下一代交通信号控制提供了可扩展、可解释且安全感知的解决方案。

Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models

Abstract

arXiv:2505.19490v1 Announce Type: new Abstract: Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models are automatically generated from parameters and appearance descriptions, supporting the automation of design tasks during the detailed CAD design phase. Our approach introduces three key innovations: (1) a semi-automated data annotation pipeline that leverages LLMs and vision-language large models (VLLMs) to generate high-quality parameters and appearance descriptions; (2) a Transformer-based CAD generator (TCADGen) that predicts modeling sequences via dual-channel feature aggregation; (3) an enhanced CAD modeling generation model, called CADLLM, that is designed to refine the generated sequences by incorporating the confidence scores from TCADGen. Experimental results demonstrate that the proposed approach outperforms traditional methods in both accuracy and efficiency, providing a powerful tool for automating industrial workflows and generating complex CAD models from textual prompts. The code is available at https://jianxliao.github.io/cadllm-page/

摘要

设计复杂的计算机辅助设计（CAD）模型常因计算效率低下和生成精确模型的困难而耗时。为解决这些问题，我们提出了一种新颖的语言引导工业设计自动化框架，将大语言模型（LLMs）与计算机自动化设计（CAutoD）相结合。该框架通过参数和外观描述自动生成CAD模型，支持详细CAD设计阶段的任务自动化。我们的方法包含三项关键创新：（1）利用LLMs和视觉语言大模型（VLLMs）生成高质量参数与外观描述的半自动数据标注流程；（2）基于Transformer的双通道特征聚合CAD生成器（TCADGen），用于预测建模序列；（3）改进的CAD建模生成模型CADLLM，通过整合TCADGen的置信度分数优化生成序列。实验结果表明，所提方法在精度和效率上均优于传统方法，为工业流程自动化及文本提示生成复杂CAD模型提供了强大工具。代码详见https://jianxliao.github.io/cadllm-page/

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Abstract

arXiv:2505.19457v1 Announce Type: new Abstract: Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

摘要

大语言模型在通用任务中表现卓越，但评估其在金融、法律和医疗等逻辑密集、精度至关键领域的可靠性仍具挑战性。为此，我们推出BizFinBench——首个专为评估大语言模型在真实金融场景中应用性能而设计的基准测试。该基准包含6,781条经过精细标注的中文查询，涵盖数值计算、逻辑推理、信息抽取、预测识别和知识问答五个维度，细分为九个子类别，同时采用客观与主观双重评估指标。我们还提出IteraJudge这一创新的大语言模型评估方法，可有效降低模型作为评估者时的客观指标偏差。我们对25个专有及开源模型进行了全面测试，实验表明没有任何模型能在所有任务中占据优势。评估结果揭示了显著的能力分化：(1)数值计算任务中Claude-3.5-Sonnet(63.18)与DeepSeek-R1(64.04)领先，而Qwen2.5-VL-3B(15.92)等小模型表现欠佳；(2)逻辑推理领域由专有模型主导(ChatGPT-o3:83.58，Gemini-2.0-Flash:81.15)，开源模型最大落后19.49分；(3)信息抽取任务性能差异最为显著，DeepSeek-R1达71.46分，Qwen3-1.7B仅11.23分；(4)预测识别任务中各模型表现趋同，最优模型得分介于39.16至50.00之间。研究发现，当前大语言模型虽能胜任常规金融查询，但在需要跨概念推理的复杂场景中仍存在局限。BizFinBench为未来研究提供了严格贴合商业实践的评估基准，代码与数据集已开源于https://github.com/HiThink-Research/BizFinBench。

Customising Electricity Contracts at Scale with Large Language Models

Abstract

arXiv:2505.19551v1 Announce Type: new Abstract: The electricity system becomes more complex, connecting massive numbers of end-users and distributed generators. Adding or removing grid connections requires expert studies to align technical constraints with user requests. In times of labour shortages, carrying out these studies represents a significant amount of time that engineers at system operators spend in planning departments. As time is limited, only standard block connectivity contracts can be offered to end-users, or the requests pile up. Even if offers are made, these often do not perfectly match the user's requirements, leading to overpaying or underusing the grid capacity. This paper investigates whether end-users can negotiate individual, flexible time-of-use contracts directly with the grid using Large Language Models (LLM) in chats at scale. The LLM-based chat has direct access to a model of the grid and studies the grid's technical constraints just as an expert engineer. The advantage of this system is that end-users can directly interact with grid models through natural language; no intermediate is needed to service, analyse, study, assess, advise, consult and engineer. This initial study paves the way toward developing this tailored LLM system, resulting in possible high-efficiency gains for grid planning and customer management.

摘要

电力系统正变得日益复杂，需要连接海量终端用户和分布式发电设备。新增或移除电网连接需通过专家研究来协调技术约束与用户需求。在劳动力短缺时期，开展这些研究占据了系统运营商工程师在规划部门的大量工作时间。由于时间有限，运营商只能向终端用户提供标准区块连接合约，或导致需求积压。即便提供合约方案，也常无法完全匹配用户需求，造成电网容量过度付费或利用不足。本文研究终端用户能否通过大规模聊天交互，利用大型语言模型(LLM)直接与电网协商个性化的灵活分时用电合约。基于LLM的聊天系统可直接访问电网模型，并像专业工程师一样研究电网技术约束。该系统的优势在于终端用户可通过自然语言直接与电网模型交互，无需中间环节进行服务、分析、研究、评估、建议、咨询和工程设计。这项初步研究为开发定制化LLM系统奠定基础，有望为电网规划和客户管理带来显著效率提升。

Turing Test 2.0: The General Intelligence Threshold

Abstract

arXiv:2505.19550v1 Announce Type: new Abstract: With the rise of artificial intelligence (A.I.) and large language models like Chat-GPT, a new race for achieving artificial general intelligence (A.G.I) has started. While many speculate how and when A.I. will achieve A.G.I., there is no clear agreement on how A.G.I. can be detected in A.I. models, even when popular tools like the Turing test (and its modern variations) are used to measure their intelligence. In this work, we discuss why traditional methods like the Turing test do not suffice for measuring or detecting A.G.I. and provide a new, practical method that can be used to decide if a (computer or any other) system has reached or surpassed A.G.I. To achieve this, we make two new contributions. First, we present a clear definition for general intelligence (G.I.) and set a G.I. threshold (G.I.T.) that can be used to distinguish between systems that achieve A.G.I. and systems that do not. Second, we present a new framework on how to construct tests that can detect if a system has achieved G.I. in a simple, comprehensive, and clear-cut fail/pass way. We call this novel framework the Turing Tests 2.0. We then demonstrate real-life examples of applying tests that follow our Turing Tests 2.0 framework on modern A.I. models.

摘要

随着人工智能（A.I.）和Chat-GPT等大型语言模型的兴起，一场关于实现人工通用智能（A.G.I.）的新竞赛已然展开。尽管众多研究者推测A.I.实现A.G.I.的方式与时间节点，但关于如何在A.I.模型中检测A.G.I.仍缺乏明确共识——即便使用图灵测试（及其现代变体）等流行工具来评估其智能水平。本研究论述了为何图灵测试等传统方法不足以衡量或检测A.G.I.，并提出了一种可实际用于判定（计算机或其他）系统是否达到或超越A.G.I.的新方法。为此，我们作出两项新贡献：首先提出通用智能（G.I.）的明确定义，并设立可用于区分是否达成A.G.I.的通用智能阈值（G.I.T.）；其次构建新型测试框架，以简单、全面且非黑即白的通过/失败方式检测系统是否实现G.I.。我们将这一创新框架命名为"图灵测试2.0"，并通过在现代A.I.模型上应用符合该框架测试的实际案例进行实证展示。

Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights

Abstract

arXiv:2505.19563v1 Announce Type: new Abstract: Reasoning with tabular data holds increasing importance in modern applications, yet comprehensive evaluation methodologies for reasoning-intensive Table Question Answering (QA) tasks remain nascent. Existing research is constrained by two primary bottlenecks: 1) Reliance on costly manually annotated real-world data, which is difficult to cover complex reasoning scenarios; 2) The heterogeneity of table structures hinders systematic analysis of the intrinsic mechanisms behind the underperformance of LLMs, especially in reasoning-intensive tasks. To address these issues, we propose an automated generation pipeline AutoT2T that transforms mathematical word problems into table-based reasoning tasks, eliminating the need for manual annotation. The pipeline can generate multiple variants of a table for the same reasoning problem, including noisy versions to support robustness evaluation. Based on this, we construct a new benchmark TabularGSM, which systematically spans a range of table complexities and trap problems. Experimental analyses through AutoT2T and TabularGSM reveal that the tight coupling between reasoning and retrieval or identification processes is a key factor underlying the failure of LLMs in complex Table QA tasks. This highlights the necessity for models to develop synergistic reasoning capabilities in order to perform effectively in complex Table QA tasks.

摘要

在现代应用中，基于表格数据的推理日益重要，然而针对推理密集型表格问答（QA）任务的综合评估方法仍处于起步阶段。现有研究主要受限于两个瓶颈：1）依赖成本高昂的人工标注真实数据，难以覆盖复杂推理场景；2）表格结构的异质性阻碍了对大语言模型（LLMs）在推理密集型任务中表现不佳的内在机制进行系统分析。为解决这些问题，我们提出自动化生成流程AutoT2T，将数学文字问题转化为基于表格的推理任务，无需人工标注。该流程能针对同一推理问题生成包括支持鲁棒性评估的噪声版本在内的多种表格变体。基于此，我们构建了新基准TabularGSM，系统覆盖不同复杂度表格及陷阱问题。通过AutoT2T和TabularGSM的实验分析表明，推理与检索或识别过程的紧密耦合是LLMs在复杂表格QA任务中失败的关键因素，这凸显了模型需发展协同推理能力以有效应对复杂表格QA任务的必要性。

AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

Abstract

arXiv:2505.19562v1 Announce Type: new Abstract: Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.

摘要

大型语言模型（LLMs）在医学诊断问题上已达到专家级准确度，但其错误及背后的偏见仍存在危及生命的风险。与种族、性别和社会经济地位相关的偏见已广为人知，但尚缺乏一致且自动化的测试平台来衡量这些偏见。为填补这一空白，本文提出AMQA——一个对抗性医学问答数据集——专为医学QA中LLMs的自动化、大规模偏见评估而构建。AMQA包含4,806个医学问答对，源自美国医师执照考试（USMLE）数据集，通过多智能体框架生成多样化的对抗性描述和问题对。利用AMQA，我们对五种代表性LLMs进行基准测试，发现存在惊人的显著差异：即使是被测试模型中偏见最少的GPT-4.1，其对特权群体问题的回答准确率仍比非特权群体高出10个百分点以上。与现有基准CPV相比，AMQA揭示的特权与非特权群体间准确率差距平均扩大15%。我们的数据集和代码已公开于https://github.com/XY-Showing/AMQA，以支持可重复研究并推动可信赖、具有偏见意识的医疗AI发展。

Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

Abstract

arXiv:2505.19621v1 Announce Type: new Abstract: As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: https://ibm.github.io/POBS

摘要

随着大型语言模型（LLMs）深度融入人类生活并日益影响决策过程，评估其是否及在何种程度上表现出主观偏好、观点和信念变得至关重要。这些倾向可能源于模型内部的偏见，进而塑造其行为、影响向用户提供的建议与推荐，并可能强化特定观点。本文提出'偏好、观点与信念调查'（POBs）基准，该基准用于评估LLMs在社会、文化、伦理及个人领域的倾向性。我们运用该基准评估了领先的开源与闭源LLMs，测量了可靠性、中立性和一致性等关键属性。此外，我们通过推理与自省机制研究了增加测试时计算资源对这些指标的影响。结果显示，尽管这些机制在其他任务中有效，但在本领域仅能带来有限提升。进一步研究发现，新版模型正变得愈发不一致，且更倾向于特定观点，这揭示了当前研究的盲点及令人担忧的发展趋势。POBS项目地址：https://ibm.github.io/POBS

LLM-Agent-Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer

Abstract

arXiv:2505.19567v1 Announce Type: new Abstract: This study presents the LLM-Agent-Controller, a multi-agent large language model (LLM) system developed to address a wide range of problems in control engineering (Control Theory). The system integrates a central controller agent with multiple specialized auxiliary agents, responsible for tasks such as controller design, model representation, control analysis, time-domain response, and simulation. A supervisor oversees high-level decision-making and workflow coordination, enhancing the system's reliability and efficiency. The LLM-Agent-Controller incorporates advanced capabilities, including Retrieval-Augmented Generation (RAG), Chain-of-Thought reasoning, self-criticism and correction, efficient memory handling, and user-friendly natural language communication. It is designed to function without requiring users to have prior knowledge of Control Theory, enabling them to input problems in plain language and receive complete, real-time solutions. To evaluate the system, we propose new performance metrics assessing both individual agents and the system as a whole. We test five categories of Control Theory problems and benchmark performance across three advanced LLMs. Additionally, we conduct a comprehensive qualitative conversational analysis covering all key services. Results show that the LLM-Agent-Controller successfully solved 83% of general tasks, with individual agents achieving an average success rate of 87%. Performance improved with more advanced LLMs. This research demonstrates the potential of multi-agent LLM architectures to solve complex, domain-specific problems. By integrating specialized agents, supervisory control, and advanced reasoning, the LLM-Agent-Controller offers a scalable, robust, and accessible solution framework that can be extended to various technical domains.

摘要

本研究提出LLM-Agent-Controller——一个为解决控制工程（控制理论）领域广泛问题而开发的多智能体大语言模型系统。该系统将中央控制器智能体与多个专业辅助智能体相集成，分别负责控制器设计、模型表示、控制分析、时域响应及仿真等任务。监督器负责高层决策与工作流协调，从而提升系统的可靠性和效率。该架构融合了检索增强生成、思维链推理、自我批判与修正、高效记忆处理及用户友好的自然语言交互等先进功能，其设计使得用户无需具备控制理论背景知识，仅需用自然语言输入问题即可获得完整的实时解决方案。为评估系统性能，我们提出了同时评估单个智能体与整体系统的新指标，测试了五类控制理论问题并在三种先进大语言模型上进行基准比较，还对所有核心服务进行了全面的定性对话分析。结果表明，该系统成功解决了83%的常规任务，各智能体平均成功率达87%，且性能随大语言模型升级而提升。本研究证明了多智能体大语言模型架构在解决复杂领域特定问题方面的潜力，通过整合专业智能体、监督控制与高级推理能力，该框架提供了可扩展、鲁棒且易用的解决方案，可推广至多种技术领域。

Token-Importance Guided Direct Preference Optimization

Abstract

arXiv:2505.19653v1 Announce Type: new Abstract: Ensuring that large language models (LLMs) generate outputs aligned with human preferences is important for safe and effective AI interactions. While Direct Preference Optimization (DPO) employs an implicit reward function to optimize the policy model, however, it and its related variants overlook the differential importance of individual tokens and are sensitive to judgment noise in preference datasets during generation. Although recent methods attempt to assess the important weight of tokens via probability prediction or simplistic weighting schemes, these evaluation methods are prone to biases and still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), which introduces two key innovations: the gradient-based token-importance weights that dynamically prioritize critical tokens, and a triple loss that explicitly guides model outputs to approach human-preferred responses and stay away from non-preferred responses. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.

摘要

确保大型语言模型（LLM）生成的输出符合人类偏好，对于实现安全有效的人工智能交互至关重要。虽然直接偏好优化（DPO）采用隐式奖励函数来优化策略模型，但其及相关变体方法忽视了单个令牌的差异性重要性，且在生成过程中对偏好数据集中的判断噪声较为敏感。尽管近期研究尝试通过概率预测或简单加权方案评估令牌的重要性权重，但这些评估方法容易产生偏差，仍无法完全解决上述问题。为此，我们提出令牌重要性引导的直接偏好优化（TI-DPO），其包含两项关键创新：基于梯度的动态令牌重要性权重机制——优先处理关键令牌，以及三重损失函数——显式引导模型输出接近人类偏好响应并远离非偏好响应。实验结果表明，与DPO及其他强化学习人类反馈（RLHF）方法相比，TI-DPO具有更高的准确性和更强的生成多样性，能提供更稳定且计算效率更优的解决方案。

MSD-LLM: Predicting Ship Detention in Port State Control Inspections with Large Language Model

Abstract

arXiv:2505.19568v1 Announce Type: new Abstract: Maritime transportation is the backbone of global trade, making ship inspection essential for ensuring maritime safety and environmental protection. Port State Control (PSC), conducted by national ports, enforces compliance with safety regulations, with ship detention being the most severe consequence, impacting both ship schedules and company reputations. Traditional machine learning methods for ship detention prediction are limited by the capacity of representation learning and thus suffer from low accuracy. Meanwhile, autoencoder-based deep learning approaches face challenges due to the severe data imbalance in learning historical PSC detention records. To address these limitations, we propose Maritime Ship Detention with Large Language Models (MSD-LLM), integrating a dual robust subspace recovery (DSR) layer-based autoencoder with a progressive learning pipeline to handle imbalanced data and extract meaningful PSC representations. Then, a large language model groups and ranks features to identify likely detention cases, enabling dynamic thresholding for flexible detention predictions. Extensive evaluations on 31,707 PSC inspection records from the Asia-Pacific region show that MSD-LLM outperforms state-of-the-art methods more than 12% on Area Under the Curve (AUC) for Singapore ports. Additionally, it demonstrates robustness to real-world challenges, making it adaptable to diverse maritime risk assessment scenarios.

摘要

海事运输是全球贸易的支柱，船舶检查对保障海上安全和环境保护至关重要。港口国监督（PSC）作为各国港口实施的监管机制，通过强制遵守安全法规来确保航行安全，其中船舶滞留是最严厉的处罚措施，会对船舶调度和公司声誉造成重大影响。传统机器学习方法因表征学习能力有限，导致船舶滞留预测准确率较低；而基于自动编码器的深度学习方法则因港口国监督滞留记录存在严重数据不平衡问题面临挑战。为突破这些局限，我们提出基于大语言模型的海事船舶滞留预测框架（MSD-LLM），通过集成双鲁棒子空间恢复层的自动编码器与渐进式学习流程，有效处理不平衡数据并提取有意义的港口国监督特征表示。随后利用大语言模型对特征进行分组排序以识别潜在滞留案例，并通过动态阈值实现灵活的滞留预测。基于亚太地区31,707条港口国监督检查记录的实验表明，该框架在新加坡港口的曲线下面积（AUC）指标上以超过12%的优势优于现有最优方法，同时展现出对实际应用挑战的强鲁棒性，可适应多样化海事风险评估场景。

Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models

Abstract

arXiv:2505.19676v1 Announce Type: new Abstract: Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation. Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period. By tracking completion tokens, we show that almost all improvement in reasoning ability since GPT-4 was released can be attributed to either hidden system prompts or the training of models to automatically use generic Chain of Thought prompting strategies. Among the ATP reasoning strategies tried, we found that current frontier LLMs are best able to follow the bottom-up (also known as forward-chaining) strategy. A low positive correlation was found between an LLM response containing correct reasoning and arriving at the correct conclusion.

摘要

本文研究了评估大语言模型(LLM)运用自动定理证明器(ATP)推理策略能力的实证方法。我们评估了2023年12月至2024年8月期间最先进模型在PRONTOQA steamroller推理问题上的表现。为此，我们开发了评估LLM响应准确性和正确答案相关性的方法。研究结果表明，在九个月期间，LLM推理能力的提升进展陷入停滞。通过追踪完成标记，我们发现自GPT-4发布以来，几乎所有推理能力的提升都可归因于隐藏系统提示或训练模型自动使用通用思维链提示策略。在尝试的ATP推理策略中，当前前沿LLM最擅长遵循自底向上(又称前向链接)策略。研究发现，LLM响应包含正确推理与得出正确结论之间存在较低的正相关性。

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

Abstract

arXiv:2505.19662v1 Announce Type: new Abstract: This paper proposes FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are required to monitor and report safety and health incidents, as well as manufacturing-related incidents, that may occur in real-world work environments. Existing agentic AI benchmarks have been limited to evaluating web tasks and are insufficient for evaluating agents in real-world work environments, where complexity increases significantly. In this paper, we define a new action space that agentic AI should possess for real world work environment benchmarks and improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. The dataset consists of videos captured on-site and documents actually used in factories and warehouses, and tasks were created based on interviews with on-site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Additionally, the effectiveness and limitations of the proposed new evaluation method were identified. The complete dataset (HuggingFace) and evaluation program (GitHub) can be downloaded from the following website: https://en-documents.research.global.fujitsu.com/fieldworkarena/.

摘要

本文提出FieldWorkArena基准，旨在针对现实世界现场工作的代理人工智能进行评估。随着近期对代理AI需求的增长，这些系统需要监测并报告现实工作环境中可能发生的安全健康事件及制造相关事故。现有代理AI基准仅限于评估网络任务，无法充分评估在复杂度显著提升的现实工作环境中的代理性能。本研究定义了代理AI在现实工作环境基准中应具备的新动作空间，并改进了先前方法的评估函数，以评估代理AI在多样化现实任务中的表现。数据集由现场拍摄视频及工厂仓库实际使用文档构成，任务设计基于对现场工人和管理者的访谈。评估结果证实，考虑GPT-4o等多模态大语言模型（MLLM）特性的性能评估具有可行性。同时明确了所提新评估方法的有效性和局限性。完整数据集（HuggingFace）和评估程序（GitHub）可从以下网站下载：https://en-documents.research.global.fujitsu.com/fieldworkarena/。

Large Language Models for Planning: A Comprehensive and Systematic Survey

Abstract

arXiv:2505.19683v1 Announce Type: new Abstract: Planning represents a fundamental capability of intelligent agents, requiring comprehensive environmental understanding, rigorous logical reasoning, and effective sequential decision-making. While Large Language Models (LLMs) have demonstrated remarkable performance on certain planning tasks, their broader application in this domain warrants systematic investigation. This paper presents a comprehensive review of LLM-based planning. Specifically, this survey is structured as follows: First, we establish the theoretical foundations by introducing essential definitions and categories about automated planning. Next, we provide a detailed taxonomy and analysis of contemporary LLM-based planning methodologies, categorizing them into three principal approaches: 1) External Module Augmented Methods that combine LLMs with additional components for planning, 2) Finetuning-based Methods that involve using trajectory data and feedback signals to adjust LLMs in order to improve their planning abilities, and 3) Searching-based Methods that break down complex tasks into simpler components, navigate the planning space, or enhance decoding strategies to find the best solutions. Subsequently, we systematically summarize existing evaluation frameworks, including benchmark datasets, evaluation metrics and performance comparisons between representative planning methods. Finally, we discuss the underlying mechanisms enabling LLM-based planning and outline promising research directions for this rapidly evolving field. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this field.

摘要

规划是智能体的核心能力，需要综合的环境理解、严谨的逻辑推理和有效的序列决策。尽管大语言模型（LLMs）在某些规划任务中表现出卓越性能，但其在该领域的广泛应用仍需系统研究。本文对基于LLM的规划方法进行了全面综述：首先通过介绍自动化规划的基本定义与分类建立理论基础；其次详细梳理了当前基于LLM的规划方法学，将其归纳为三大类——1）外部模块增强法：通过结合附加组件与LLMs协同规划，2）微调法：利用轨迹数据与反馈信号调整LLMs以提升规划能力，3）搜索法：将复杂任务分解为简单组件、遍历规划空间或优化解码策略以寻求最优解；随后系统总结了现有评估框架，包括基准数据集、评价指标及代表性规划方法的性能对比；最后探讨了LLM实现规划的内在机制，并展望了这一快速发展领域的潜在研究方向。本综述旨在为该领域的创新研究提供有价值的参考，推动相关技术进步。

ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection

Abstract

arXiv:2505.19734v1 Announce Type: new Abstract: Coding with hardware description languages (HDLs) such as Verilog is a time-intensive and laborious task. With the rapid advancement of large language models (LLMs), there is increasing interest in applying LLMs to assist with HDL coding. Recent efforts have demonstrated the potential of LLMs in translating natural language to traditional HDL Verilog. Chisel, a next-generation HDL based on Scala, introduces higher-level abstractions, facilitating more concise, maintainable, and scalable hardware designs. However, the potential of using LLMs for Chisel code generation remains largely unexplored. This work proposes ReChisel, an LLM-based agentic system designed to enhance the effectiveness of Chisel code generation. ReChisel incorporates a reflection mechanism to iteratively refine the quality of generated code using feedback from compilation and simulation processes, and introduces an escape mechanism to break free from non-progress loops. Experiments demonstrate that ReChisel significantly improves the success rate of Chisel code generation, achieving performance comparable to state-of-the-art LLM-based agentic systems for Verilog code generation.

摘要

使用Verilog等硬件描述语言（HDL）进行编码是一项耗时且繁琐的任务。随着大语言模型（LLM）的快速发展，人们越来越关注如何应用LLM辅助HDL编码。近期研究表明，LLM在将自然语言转换为传统HDL Verilog方面具有潜力。Chisel作为基于Scala的下一代HDL，引入了更高层次的抽象，有助于实现更简洁、可维护和可扩展的硬件设计。然而，利用LLM生成Chisel代码的潜力尚未得到充分探索。本研究提出ReChisel，一个基于LLM的代理系统，旨在提升Chisel代码生成的效率。ReChisel通过集成反射机制，利用编译和仿真过程的反馈迭代优化生成代码质量，并引入逃逸机制以跳出非进展循环。实验表明，ReChisel显著提高了Chisel代码生成的成功率，其性能可与最先进的基于LLM的Verilog代码生成代理系统相媲美。

Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models

Abstract

arXiv:2505.19690v1 Announce Type: new Abstract: Despite the remarkable proficiency of \textit{Large Reasoning Models} (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as \textbf{\textit{Superficial Safety Alignment} (SSA)} -- a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce \textbf{Beyond Safe Answers (BSA)} bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types and spanning nine risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.

摘要

尽管大型推理模型（LRMs）在处理复杂推理任务方面表现出卓越能力，但其在安全关键场景中的可靠性仍存在不确定性。现有评估主要关注响应层面的安全性，却忽视了我们发现的关键问题——表面安全对齐（SSA）。该现象表现为模型生成表面安全的输出，而其内部推理过程未能真正识别和缓解潜在风险，导致多次采样尝试中出现不一致的安全行为。为系统研究SSA，我们提出超越安全答案（BSA）基准，该新型基准包含2,000个挑战性实例，分为三种SSA场景类型，涵盖九大风险类别，每个实例均经过风险原理的精细标注。对19个最先进LRMs的评估表明该基准具有较高难度，表现最佳的模型在正确识别风险原理方面仅达到38.0%准确率。我们进一步探究了安全规则、针对安全推理数据的专项微调以及多样化解码策略在缓解SSA方面的有效性。本研究为评估和提升LRMs的安全推理保真度提供了全面评估工具，推动了真正具备风险意识且可靠安全的人工智能系统的发展。

SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

Abstract

arXiv:2505.19641v1 Announce Type: new Abstract: Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.

摘要

OpenAI-o1和DeepSeek R1等最新进展证明了强化学习（RL）在增强大语言模型（LLMs）推理能力方面的潜力。尽管开源复现工作主要集中在数学和编程领域，但开发通用推理能力的方法和资源仍未被充分探索。这一空白部分源于难以收集适合RL训练的多样化且可验证的推理数据。我们假设逻辑推理是发展通用推理能力的关键，因为逻辑构成推理的基础构建模块。本研究提出SynLogic——一个可规模化生成多样化逻辑推理数据的数据合成框架与数据集，涵盖35类不同的逻辑推理任务。SynLogic方法能按需调节数据难度与数量进行可控合成。重要的是，所有示例均可通过简单规则验证，使其特别适合搭配可验证奖励机制的RL训练。实验基于7B和32B模型验证了SynLogic数据集上RL训练的有效性：在开源数据集中，SynLogic实现了最先进的逻辑推理性能，在BBEH基准上以6分优势超越DeepSeek-R1-Distill-Qwen-32B。此外，将SynLogic数据与数学及编程任务混合训练，能提升这些领域的训练效率并显著增强推理泛化能力。值得注意的是，我们的混合训练模型在多个基准测试中全面优于DeepSeek-R1-Zero-Qwen-32B。这些发现使SynLogic成为推进LLMs广义推理能力的重要资源。我们已在https://github.com/MiniMax-AI/SynLogic开源数据合成管道与SynLogic数据集。

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

Abstract

arXiv:2505.19761v1 Announce Type: new Abstract: While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework GLIDER (Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical Reinforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.

摘要

尽管大型语言模型（LLMs）展现出复杂的推理能力，但由于探索不足和长期信用分配问题，其在长时程决策任务中仍存在困难，尤其在稀疏奖励场景下。受分治原则启发，我们提出创新框架GLIDER（Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical Reinforcement Learning），该框架为LLM策略引入了一种参数高效且普遍适用的层次结构。我们设计了一种方案，其中低级控制器通过高级策略学习并指导的抽象分步计划进行监督。该设计将复杂问题分解为一系列连贯的思维链推理子任务，通过灵活的时间抽象显著增强长时程任务的探索与学习能力。此外，得益于其任务无关低级技能的强可迁移性，GLIDER能够快速在线适应非平稳环境。在ScienceWorld和ALFWorld基准测试上的实验表明，GLIDER实现了持续的性能提升，并展现出更强的泛化能力。

Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting

Abstract

arXiv:2505.19716v1 Announce Type: new Abstract: Existing chain-of-thought (CoT) distillation methods can effectively transfer reasoning abilities to base models but suffer from two major limitations: excessive verbosity of reasoning traces and inadequate adaptability to problem difficulty. Long reasoning traces significantly increase inference costs, and uniform-length solutions prevent base models from learning adaptive reasoning strategies. To address these issues, we propose a difficulty-aware prompting (DAP) method to dynamically shorten reasoning traces without performance loss. In our approach, a large teacher model first judges each problem's difficulty and then rewrites its reasoning traces to an appropriate shorter length, yielding concise yet complete reasoning traces. Leveraging the DAP pipeline, we curate a distilled dataset called LiteCoT consisting of 100K concise reasoning examples, with solutions averaging only 720 tokens (an order of magnitude shorter than typical CoTs). Using LiteCoT, we distilled a new family of reasoning models called Liter (1.5B, 7B, and 32B) based on the Qwen2.5 architecture. Experiments show that a student model fine-tuned on just 100K of these difficulty-pruned CoT samples outperforms a model distilled on 800K original Long CoT samples, while significantly reducing training and inference costs. Our method also generalizes well: across 11 diverse benchmarks, the shorter difficulty-aware CoTs achieve equal or better accuracy than Long chains, using far fewer tokens. For example, on the challenging AIME24 exam, our approach reaches $74.2\%$ Pass@1 using only about 5K inference tokens, surpassing other methods that consume many more tokens. Our code and data are available at https://github.com/Evanwu1125/LiteCoT.

摘要

现有思维链（CoT）蒸馏方法能有效将推理能力迁移至基础模型，但存在两大局限：推理轨迹过于冗长及对问题难度适应性不足。冗长的推理轨迹显著增加推理成本，而统一长度的解决方案阻碍基础模型学习自适应推理策略。为解决这些问题，我们提出难度感知提示（DAP）方法，在不损失性能的前提下动态缩短推理轨迹。该方法首先由大型教师模型判断问题难度，随后将其推理轨迹改写为适当缩短的长度，从而生成简洁完整的推理轨迹。基于DAP流程，我们构建了包含10万条精简推理样本的LiteCoT蒸馏数据集，其解决方案平均仅720个token（比典型CoT缩短一个数量级）。使用LiteCoT数据集，我们在Qwen2.5架构上蒸馏出新型推理模型系列Liter（1.5B/7B/32B）。实验表明，仅用10万条经难度筛选的CoT样本微调的学生模型，其性能优于基于80万条原始长CoT样本蒸馏的模型，同时显著降低训练和推理成本。该方法泛化性良好：在11个多样化基准测试中，较短的难度感知CoT使用更少token即可达到与长链相同或更高的准确率。例如在AIME24高难度考试中，我们的方法仅消耗约5K推理token即达到74.2%的Pass@1，优于其他消耗更多token的方法。代码与数据详见https://github.com/Evanwu1125/LiteCoT。

FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets

Abstract

arXiv:2505.19819v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) methods show great potential for scaling pre-trained general-purpose Large Language Models (LLMs) to hundreds or thousands of use scenarios. However, their efficacy in high-stakes domains like finance is rarely explored, e.g., passing CFA exams and analyzing SEC filings. In this paper, we present the open-source FinLoRA project that benchmarks LoRA methods on both general and highly professional financial tasks. First, we curated 19 datasets covering diverse financial applications; in particular, we created four novel XBRL analysis datasets based on 150 SEC filings. Second, we evaluated five LoRA methods and five base LLMs. Finally, we provide extensive experimental results in terms of accuracy, F1, and BERTScore and report computational cost in terms of time and GPU memory during fine-tuning and inference stages. We find that LoRA methods achieved substantial performance gains of 36% on average over base models. Our FinLoRA project provides an affordable and scalable approach to democratize financial intelligence to the general public. Datasets, LoRA adapters, code, and documentation are available at https://github.com/Open-Finance-Lab/FinLoRA

摘要

低秩自适应（LoRA）方法在将预训练的通用大语言模型（LLM）扩展至数百甚至数千种应用场景方面展现出巨大潜力。然而，其在金融等高风险领域的有效性鲜少被探索，例如通过CFA考试和分析美国证券交易委员会（SEC）文件。本文提出开源项目FinLoRA，对LoRA方法在通用及高度专业化金融任务上的表现进行基准测试。首先，我们整理了涵盖多样化金融应用的19个数据集；特别地，基于150份SEC文件创建了四个新颖的XBRL分析数据集。其次，我们评估了五种LoRA方法和五种基础LLM。最后，我们从准确率、F1值和BERTScore等维度提供了大量实验结果，并报告了微调与推理阶段的时间及GPU内存计算成本。研究发现，LoRA方法相较基础模型平均实现了36%的性能提升。FinLoRA项目为公众提供了一种经济、可扩展的金融智能化普及方案。数据集、LoRA适配器、代码及文档详见https://github.com/Open-Finance-Lab/FinLoRA。

Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition

Abstract

arXiv:2505.19788v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) are criticized for the excessively lengthy Chain-of-Thought (CoT) to derive the final answer, suffering from high first-token and overall latency. Typically, the CoT of LRMs mixes multiple thinking units; each unit attempts to produce a candidate answer to the original query. Hence, a natural idea to improve efficiency is to reduce the unit number. Yet, the fact that the thinking units in vanilla CoT cannot be explicitly managed renders doing so challenging. This paper introduces Multi-Turn Decomposition (MinD) to decode conventional CoT into a sequence of explicit, structured, and turn-wise interactions to bridge the gap. In MinD, the model provides a multi-turn response to the query, where each turn embraces a thinking unit and yields a corresponding answer. The subsequent turns can reflect, verify, revise, or explore alternative approaches to both the thinking and answer parts of earlier ones. This not only makes the answer delivered more swiftly, but also enables explicit controls over the iterative reasoning process (i.e., users may halt or continue at any turn). We follow a supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm to realize MinD. We first rephrase the outputs of an LRM into multi-turn formats by prompting another LLM, and then tune the LRM with such data. Observing that the tuned model tends to consume even more tokens than the original one (probably due to that the multi-turn formats introduce additional answer tokens), we advocate leveraging RL algorithms like GRPO to prioritize correct outputs with fewer turns. Trained on the MATH dataset using R1-Distill models, MinD can achieve up to ~70% reduction in both output token usage and time to first token (TTFT), while maintaining competitive performance on reasoning benchmarks such as MATH-500, AIME24, AMC23, and GPQA-Diamond.

摘要

大型推理模型（LRMs）因生成最终答案时需要过长的思维链（CoT）而受到批评，存在首词延迟和总体延迟过高的问题。通常，LRMs的CoT混合了多个思维单元，每个单元试图为原始查询生成一个候选答案。因此，提高效率的自然思路是减少思维单元数量。然而，传统CoT中的思维单元无法显式管理，使得这一目标难以实现。本文提出多轮分解（MinD）方法，将传统CoT解码为一系列显式、结构化、轮次化的交互以弥合这一差距。在MinD中，模型对查询提供多轮响应，每轮包含一个思维单元并生成相应答案。后续轮次可对先前轮次的思维部分和答案部分进行反思、验证、修正或探索替代方案。这不仅使答案更快呈现，还能实现对迭代推理过程的显式控制（用户可在任意轮次停止或继续）。我们采用监督微调（SFT）结合强化学习（RL）的范式实现MinD：首先通过提示另一个LLM将LRM的输出重述为多轮格式，随后用此类数据微调LRM。发现微调后的模型倾向于消耗比原始模型更多的token（可能因多轮格式引入了额外答案token），我们主张采用GRPO等RL算法优先选择轮次更少的正确输出。在MATH数据集上使用R1-Distill模型训练的MinD，能在保持MATH-500、AIME24、AMC23和GPQA-Diamond等推理基准竞争力的同时，实现输出token使用量和首词时间（TTFT）最高约70%的降低。

DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems

Abstract

arXiv:2505.19847v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the capabilities of language models by integrating external knowledge. Due to the diversity of data sources and the constraints of memory and computing resources, real-world data is often scattered in multiple devices. Conventional RAGs that store massive amounts of scattered data centrally face increasing privacy concerns and high computational costs. Additionally, RAG in a central node raises latency issues when searching over a large-scale knowledge base. To address these challenges, we propose a distributed Knowledge Graph-based RAG approach, referred to as DGRAG, in an edge-cloud system, where each edge device maintains a local knowledge base without the need to share it with the cloud, instead sharing only summaries of its knowledge. Specifically, DGRAG has two main phases. In the Distributed Knowledge Construction phase, DGRAG organizes local knowledge using knowledge graphs, generating subgraph summaries and storing them in a summary database in the cloud as information sharing. In the Collaborative Retrieval and Generation phase, DGRAG first performs knowledge retrieval and answer generation locally, and a gate mechanism determines whether the query is beyond the scope of local knowledge or processing capabilities. For queries that exceed the local knowledge scope, the cloud retrieves knowledge from the most relevant edges based on the summaries and generates a more precise answer. Experimental results demonstrate the effectiveness of the proposed DGRAG approach in significantly improving the quality of question-answering tasks over baseline approaches.

摘要

检索增强生成（RAG）作为一种通过整合外部知识来增强语言模型能力的方法，已展现出广阔前景。由于数据源的多样性与内存、计算资源的限制，现实世界中的数据往往分散存储于多个设备中。传统RAG方案集中存储海量分散数据，不仅面临日益严峻的隐私问题，还伴随高昂的计算成本。此外，在中央节点实施RAG时，大规模知识库检索会引发延迟问题。为应对这些挑战，我们提出一种基于分布式知识图的RAG方法（简称DGRAG），部署于边缘-云系统中。该方法中，每个边缘设备维护本地知识库而无需共享原始数据，仅通过知识摘要实现信息交互。具体而言，DGRAG包含两个核心阶段：在分布式知识构建阶段，系统利用知识图谱组织本地知识，生成子图摘要并存储于云端的摘要数据库以实现信息共享；在协同检索与生成阶段，DGRAG首先在本地执行知识检索与答案生成，并通过门控机制判断查询是否超出本地知识范围或处理能力。对于超出本地知识范围的查询，云端将根据摘要从最相关的边缘设备检索知识，并生成更精确的答案。实验结果表明，相较于基线方法，所提出的DGRAG方案能显著提升问答任务的质量。

HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation

Abstract

arXiv:2505.19866v1 Announce Type: new Abstract: Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM's reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.

摘要

自学推理器（STaRs）通过利用自生成的响应进行自我训练，增强了大型语言模型（LLMs）的数学推理能力。近期研究引入奖励模型以指导响应选择或解码，旨在获取更高质量的数据。然而，这些方法通常对所有问题分配统一的采样预算，忽视了不同难度问题在效用上的差异。本研究通过实证分析发现，位于模型推理能力边界附近的问题，其学习效用显著高于简单或过度困难的问题。为识别并利用此类问题，我们提出HS-STaR框架——一种面向自学推理器的分层采样方法。在固定采样预算下，HS-STaR首先采用基于奖励的难度评估策略进行轻量级预采样，高效定位边界级问题；随后在重采样阶段动态将剩余预算重新分配给这些高效用问题，从而最大化有价值训练数据的生成。跨多个推理基准和骨干LLMs的广泛实验表明，HS-STaR在不增加采样预算的前提下，显著优于其他基线方法。

TCP: a Benchmark for Temporal Constraint-Based Planning

Abstract

arXiv:2505.19927v1 Announce Type: new Abstract: Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark, that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we first generate abstract problem prototypes that are paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models struggle with TCP, highlighting its difficulty and revealing limitations in LLMs' temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.

摘要

时间推理与规划是大语言模型（LLMs）的核心能力，但现有基准测试大多孤立评估这两项能力且复杂度有限。为弥补这一不足，我们提出基于时间约束的规划（TCP）基准，该基准可联合评估上述双重能力。TCP每个实例围绕协作项目构建自然对话，其中显性或隐式包含多样且相互依赖的时间约束，模型必须推断出满足所有约束的最优时间表。TCP的构建首先生成抽象问题原型，将其与多领域现实场景配对，并利用LLM扩展为对话。通过对抽样子集的人工质检，我们验证了基准的可靠性。评估表明，即使最先进的LLMs在TCP上也表现不佳，凸显其难度并揭示LLMs在基于时间约束的规划能力上的局限。我们分析了典型错误案例，开源了基准测试集，期望研究成果能推动未来探索。

Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program

Abstract

arXiv:2505.19896v1 Announce Type: new Abstract: Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \href{https://github.com/ARCLab-MIT/kspdg}{GitHub}, while the trained models and datasets are available on \href{https://huggingface.co/OhhTuRnz}{Hugging Face}. Additionally, experiment tracking and detailed results can be reviewed on \href{https://wandb.ai/carrusk/huggingface}{Weights & Biases

摘要

当前出现了一种新趋势，即利用大型语言模型（LLMs）作为自主代理，根据用户文本提示的内容采取行动。我们计划将这些概念应用于空间控制领域，使LLMs在自主卫星操作的决策过程中发挥重要作用。作为实现该目标的第一步，我们为Kerbal太空计划差分博弈（KSPDG）挑战开发了一个纯基于LLM的解决方案。KSPDG是一项公开的软件设计竞赛，参赛者需创建自主代理，用于在KSP游戏引擎上操控参与非合作空间操作的卫星。我们的方法结合了提示工程、少样本提示和微调技术，开发出一个高效的基于LLM的代理，并在竞赛中获得第二名。据我们所知，这项工作首次将LLM代理集成到空间研究中。该项目包含多个开放仓库，以便于复现和进一步研究。代码库可在GitHub上获取，而训练好的模型和数据集则发布于Hugging Face。此外，实验跟踪和详细结果可在Weights & Biases上查看。

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

Abstract

arXiv:2505.19905v1 Announce Type: new Abstract: Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

摘要

尽管大型语言模型（LLM）在多项基于文本的推理与规划任务中展现出卓越能力，但其在机器人控制领域的应用仍存在显著局限：（1）现有LLM智能体主要设计用于处理文本输入而非视觉条件；（2）当前多模态智能体将LLM视为静态规划器，使其推理过程与环境动态分离，导致动作决策缺乏领域特异性知识；（3）LLM不具备从视觉交互中学习的能力，难以针对特定领域优化策略。本文提出EMAC+——一种通过双向训练范式协同整合LLM与视觉语言模型（VLM）的具身多模态智能体。与现有方法不同，EMAC+利用执行底层视觉控制任务的VLM实时反馈，动态优化LLM生成的高级文本规划方案。我们通过让LLM直接内化交互体验中的视觉环境动态（而非依赖静态符号映射），解决了先前模型的关键缺陷。在ALFWorld和RT-1基准测试中的大量实验表明，EMAC+在任务性能、噪声观测鲁棒性及学习效率方面均表现优异。同时，我们开展了系统的消融研究，并对成功与失败案例进行了详细分析。

DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph

Abstract

arXiv:2505.19956v1 Announce Type: new Abstract: Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. Our code will be released.

摘要

文本到SQL（Text-to-SQL）任务旨在将自然语言问题转换为SQL查询，随着大型语言模型（LLMs）的上下文学习能力提升而取得进展。然而，现有方法相比随机选择的示例在性能上改进有限，且当使用较小规模的LLMs（如Llama 3.1-8B）时会出现显著性能下降。这表明这些方法过度依赖超大规模LLMs的固有能力，而非有效检索有用的示例。本文提出一种新颖的方法，用于高效检索示例并生成SQL查询。我们构建了一种深度上下文模式链接图（Deep Contextual Schema Link Graph），其中包含问题与其数据库模式项之间的关键信息和语义关系。这种基于图的结构能够有效表示Text-to-SQL样本，并为上下文学习检索有用的示例。在Spider基准测试上的实验结果表明，我们的方法在超大规模LLMs和小规模LLMs上均能持续提升SQL生成的性能和效率。代码将公开发布。

Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making

Abstract

arXiv:2505.19933v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used for decision making in embodied agents, yet existing safety evaluations often rely on coarse success rates and domain-specific setups, making it difficult to diagnose why and where these models fail. This obscures our understanding of embodied safety and limits the selective deployment of LLMs in high-risk physical environments. We introduce SAFEL, the framework for systematically evaluating the physical safety of LLMs in embodied decision making. SAFEL assesses two key competencies: (1) rejecting unsafe commands via the Command Refusal Test, and (2) generating safe and executable plans via the Plan Safety Test. Critically, the latter is decomposed into functional modules, goal interpretation, transition modeling, action sequencing, enabling fine-grained diagnosis of safety failures. To support this framework, we introduce EMBODYGUARD, a PDDL-grounded benchmark containing 942 LLM-generated scenarios covering both overtly malicious and contextually hazardous instructions. Evaluation across 13 state-of-the-art LLMs reveals that while models often reject clearly unsafe commands, they struggle to anticipate and mitigate subtle, situational risks. Our results highlight critical limitations in current LLMs and provide a foundation for more targeted, modular improvements in safe embodied reasoning.

摘要

大型语言模型（LLMs）在具身智能体的决策中应用日益广泛，然而现有安全评估通常依赖粗粒度的成功率指标和特定领域设置，难以诊断模型失败的具体原因及环节。这种模糊性阻碍了对具身安全性的深入理解，也限制了LLMs在高风险物理环境中的选择性部署。我们提出SAFEL框架，用于系统评估LLMs在具身决策中的物理安全性。SAFEL评估两大核心能力：(1)通过指令拒绝测试（Command Refusal Test）识别并拒绝不安全指令；(2)通过计划安全测试（Plan Safety Test）生成安全且可执行的方案。关键创新在于将后者分解为功能模块——目标解析、状态转移建模、动作序列生成，从而实现安全失效的细粒度归因。为支持该框架，我们构建了EMBODYGUARD基准测试，基于PDDL语言开发，包含942个LLM生成场景，涵盖显性恶意指令和情境性危险指令。对13个前沿LLMs的评估表明：虽然模型常能拒绝明显不安全的指令，但对潜在情境风险的预判与规避能力仍显不足。研究结果揭示了当前LLMs的关键局限，为具身推理安全性的模块化定向改进提供了理论基础。

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Abstract

arXiv:2505.19897v1 Announce Type: new Abstract: Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

摘要

大语言模型（LLMs）的影响已超越自然语言处理领域，显著推动了跨学科研究的发展。近期，各类基于LLM的智能体被开发用于辅助科学发现进程，覆盖多领域多环节。其中，能够像人类一样与操作系统交互的计算机使用型智能体，正在为自动化解决科学问题及处理研究人员工作流程中的常规任务开辟道路。认识到这些智能体的变革潜力，我们推出ScienceBoard，其包含两项互补性贡献：（1）一个真实、多领域的动态可视化科学工作流环境，集成专业软件，使智能体能够通过不同界面自主交互，以加速复杂科研任务与实验；（2）一个由人类精心策划的169项高质量、严格验证的现实任务基准，涵盖生物化学、天文学、地理信息学等领域的科学发现工作流程。对采用最先进架构（如GPT-4o、Claude 3.7、UI-TARS等）的智能体进行的广泛评估表明，尽管取得部分积极成果，它们仍难以可靠辅助科学家完成复杂工作流，整体成功率仅为15%。深度分析进一步为解决当前智能体局限性及设计更有效原则提供了宝贵见解，为构建更具科学发现能力的智能体铺平道路。我们的代码、环境与基准详见https://qiushisun.github.io/ScienceBoard-Home/。

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

Abstract

arXiv:2505.19892v1 Announce Type: new Abstract: While foundation models update slowly due to resource-intensive training requirements, domain-specific models evolve between updates. Model merging aims to combine multiple expert models into a single, more capable model, thereby reducing storage and serving costs while supporting decentralized model development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Multimodal Large Language Models (MLLMs), which extend the capabilities of LLMs through large-scale multimodal training, have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, (i) we introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. (ii) We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. (iii) We find that model merging offers a promising way for building improved MLLMs without requiring data training. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

摘要

由于资源密集的训练需求，基础模型更新缓慢，而领域专用模型在更新间隔期间持续演进。模型融合旨在将多个专家模型合并为单一更强能力的模型，从而降低存储与服务成本，同时支持去中心化的模型开发。尽管潜力巨大，先前研究主要集中于融合视觉分类模型或面向代码与数学任务的大语言模型（LLMs）。通过大规模多模态训练扩展LLM能力的多模态大语言模型（MLLMs）已受到广泛关注，但当前缺乏明确划分MLLM训练与评估任务的模型融合研究基准。本文中：（i）我们提出首个MLLM模型融合基准，涵盖视觉问答、几何、图表、光学字符识别和接地任务，并提供LoRA与全参数微调模型；进一步探索如何通过模型融合整合不同模态（如视觉-语言、音频-语言和视频-语言模型），向全能语言模型迈进。（ii）我们在基准上实现10种融合算法，并提出创新方法：通过消除任务向量噪声并基于任务向量交互定义的损失函数鲁棒优化合并向量，平均性能提升达2.48%。（iii）研究发现模型融合为构建更强MLLMs提供了无需数据训练的新途径，实验证实多模态间的互补性显著优于单一模态。

Adaptive Location Hierarchy Learning for Long-Tailed Mobility Prediction

Abstract

arXiv:2505.19965v1 Announce Type: new Abstract: Human mobility prediction is crucial for applications ranging from location-based recommendations to urban planning, which aims to forecast users' next location visits based on historical trajectories. Despite the severe long-tailed distribution of locations, the problem of long-tailed mobility prediction remains largely underexplored. Existing long-tailed learning methods primarily focus on rebalancing the skewed distribution at the data, model, or class level, neglecting to exploit the spatiotemporal semantics of locations. To address this gap, we propose the first plug-and-play framework for long-tailed mobility prediction in an exploitation and exploration manner, named \textbf{A}daptive \textbf{LO}cation \textbf{H}ier\textbf{A}rchy learning (ALOHA). First, we construct city-tailored location hierarchy based on Large Language Models (LLMs) by exploiting Maslow's theory of human motivation to design Chain-of-Thought (CoT) prompts that captures spatiotemporal semantics. Second, we optimize the location hierarchy predictions by Gumbel disturbance and node-wise adaptive weights within the hierarchical tree structure. Experiments on state-of-the-art models across six datasets demonstrate the framework's consistent effectiveness and generalizability, which strikes a well balance between head and tail locations. Weight analysis and ablation studies reveal the optimization differences of each component for head and tail locations. Furthermore, in-depth analyses of hierarchical distance and case study demonstrate the effective semantic guidance from the location hierarchy. Our code will be made publicly available.

摘要

人类移动预测对于从基于位置的推荐到城市规划等应用至关重要，其目标是根据历史轨迹预测用户的下一个访问位置。尽管位置数据存在严重的长尾分布，但长尾移动预测问题仍未得到充分探索。现有长尾学习方法主要关注在数据、模型或类别层面重新平衡偏态分布，而忽视了挖掘位置的时空语义。为填补这一空白，我们提出了首个即插即用的长尾移动预测框架ALOHA（自适应位置层次学习），采用开发与探索相结合的策略。首先，基于大语言模型构建城市定制化位置层次结构，利用马斯洛人类动机理论设计思维链提示以捕捉时空语义。其次，通过Gumbel扰动和层次树结构内节点自适应权重优化位置层级预测。在六个数据集上的最新模型实验表明，该框架具有持续有效性和泛化能力，在头部与尾部位置间实现了良好平衡。权重分析和消融研究揭示了各组件对头部与尾部位置的差异化优化效果。此外，层级距离的深入分析和案例研究验证了位置层次结构的有效语义引导作用。我们的代码将公开提供。

Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

Abstract

arXiv:2505.20075v1 Announce Type: new Abstract: Reward models trained with conventional Reinforcement Learning from AI Feedback (RLAIF) methods suffer from limited generalizability, which hinders the alignment performance of the policy model during reinforcement learning (RL). This challenge stems from various issues, including distribution shift, preference label noise, and mismatches between overly challenging samples and model capacity. In this paper, we attempt to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from the perspective of data difficulty. To address this, we propose a novel framework, $\textit{Curriculum-RLAIF}$ , which constructs preference pairs with varying difficulty levels and produces a curriculum that progressively incorporates preference pairs of increasing difficulty for reward model training. Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, significantly increasing the alignment performance of the policy model by a large margin without incurring additional inference costs compared to various non-curriculum baselines. Detailed analysis and comparisons with alternative approaches, including data selection via external pretrained reward models or internal self-selection mechanisms, as well as other curriculum strategies, further demonstrate the superiority of our approach in terms of simplicity, efficiency, and effectiveness.

摘要

采用传统人工智能反馈强化学习（RLAIF）方法训练的奖励模型存在泛化能力有限的问题，这制约了强化学习（RL）过程中策略模型的对齐性能。该挑战源于多种因素，包括分布偏移、偏好标签噪声，以及高难度样本与模型能力之间的不匹配。本文尝试通过数据驱动的方法提升奖励模型的泛化能力，其核心洞见在于：从数据难度视角看，这些问题本质上是相互关联的。为此，我们提出新型框架 $extit{Curriculum-RLAIF}$ ，该框架通过构建不同难度级别的偏好对，并设计渐进式融入递增难度偏好对的课程方案来训练奖励模型。实验结果表明，相较于多种非课程基线方法，采用Curriculum-RLAIF训练的奖励模型显著提升了泛化能力，在不增加额外推理成本的前提下大幅提高了策略模型的对齐性能。通过与外部预训练奖励模型的数据选择、内部自选择机制等替代方案以及其他课程策略的详细对比分析，进一步验证了本方法在简洁性、效率和有效性方面的优越性。

Automatic Metadata Extraction for Text-to-SQL

Abstract

arXiv:2505.19988v1 Announce Type: new Abstract: Large Language Models (LLMs) have recently become sophisticated enough to automate many tasks ranging from pattern finding to writing assistance to code generation. In this paper, we examine text-to-SQL generation. We have observed from decades of experience that the most difficult part of query development lies in understanding the database contents. These experiences inform the direction of our research. Text-to-SQL benchmarks such as SPIDER and Bird contain extensive metadata that is generally not available in practice. Human-generated metadata requires the use of expensive Subject Matter Experts (SMEs), who are often not fully aware of many aspects of their databases. In this paper, we explore techniques for automatic metadata extraction to enable text-to-SQL generation. Ee explore the use of two standard and one newer metadata extraction techniques: profiling, query log analysis, and SQL-to text generation using an LLM. We use BIRD benchmark [JHQY+23] to evaluate the effectiveness of these techniques. BIRD does not provide query logs on their test database, so we prepared a submission that uses profiling alone, and does not use any specially tuned model (we used GPT-4o). From Sept 1 to Sept 23, 2024, and Nov 11 through Nov 23, 2024 we achieved the highest score both with and without using the "oracle" information provided with the question set. We regained the number 1 spot on Mar 11, 2025, and are still at #1 at the time of the writing (May, 2025).

摘要

大型语言模型（LLMs）近期已发展得足够成熟，能够自动化执行从模式发现、写作辅助到代码生成等多种任务。本文重点研究文本到SQL的生成。根据我们数十年的经验观察，查询开发中最困难的部分在于理解数据库内容。这些经验为我们的研究指明了方向。

诸如SPIDER和Bird等文本到SQL基准测试包含丰富的元数据，但这些元数据在实际应用中通常不可获取。人工生成的元数据需要依赖昂贵领域专家（SMEs），而这些专家往往对其数据库的许多方面并不完全了解。本文探索了自动元数据提取技术以实现文本到SQL生成。

我们研究了两种标准技术和一种新型元数据提取方法的应用：数据画像分析、查询日志分析以及使用LLM进行SQL到文本的生成。采用BIRD基准测试[JHQY+23]评估这些技术的有效性。由于BIRD未提供测试数据库的查询日志，我们提交的方案仅使用数据画像分析，且未采用任何特别调优的模型（使用GPT-4o）。在2024年9月1日至23日及11月11日至23日期间，无论是否使用问题集提供的"oracle"信息，我们都获得了最高分数。我们于2025年3月11日重夺榜首位置，并在本文撰写时（2025年5月）仍保持第一。

Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models

Abstract

arXiv:2505.20087v1 Announce Type: new Abstract: Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems.

摘要

基于推理的语言模型在多个领域展现出卓越性能，其中数学和编程任务的提升尤为显著。近期研究表明，推理机制对大型语言模型的安全防护应用同样具有重要价值。本研究对基于推理的内容审核防护模型进行了全面分析，重点探讨其在推理阶段对自定义安全策略的泛化能力。我们从两个关键维度展开研究：数据效率与推理效率。在数据层面，发现基于推理的模型具有显著的样本效率，仅需远少于非推理模型的训练样本即可达到相当性能，这使得剩余数据可被重新用于挖掘高价值难样本以进一步提升模型表现。在推理层面，我们通过引入推理预算评估实际权衡，考察推理长度对延迟和准确率的影响，并探索双模式训练以实现运行时对推理行为的动态调控。本研究将为开发者在实际系统中高效训练和部署基于推理的防护模型提供实用指导。

Agentic AI Process Observability: Discovering Behavioral Variability

Abstract

arXiv:2505.20127v1 Announce Type: new Abstract: AI agents that leverage Large Language Models (LLMs) are increasingly becoming core building blocks of modern software systems. A wide range of frameworks is now available to support the specification of such applications. These frameworks enable the definition of agent setups using natural language prompting, which specifies the roles, goals, and tools assigned to the various agents involved. Within such setups, agent behavior is non-deterministic for any given input, highlighting the critical need for robust debugging and observability tools. In this work, we explore the use of process and causal discovery applied to agent execution trajectories as a means of enhancing developer observability. This approach aids in monitoring and understanding the emergent variability in agent behavior. Additionally, we complement this with LLM-based static analysis techniques to distinguish between intended and unintended behavioral variability. We argue that such instrumentation is essential for giving developers greater control over evolving specifications and for identifying aspects of functionality that may require more precise and explicit definitions.

摘要

基于大语言模型（LLMs）的人工智能代理正日益成为现代软件系统的核心构建模块。目前已有多种框架支持此类应用的规范定义，这些框架通过自然语言提示实现代理配置，明确指定各代理的角色、目标及分配工具。在此类配置中，代理行为对于任何给定输入均呈现非确定性特征，这凸显出强大调试与可观测性工具的关键需求。本研究探索将过程发现与因果发现技术应用于代理执行轨迹，以此增强开发者可观测性。该方法有助于监测和理解代理行为中涌现的变异性。此外，我们结合基于LLM的静态分析技术，以区分预期与非预期的行为变异。我们认为，此类工具对于提升开发者对演进规范的控制力，以及识别需要更精确明确定义的功能维度具有重要作用。

Capability-Based Scaling Laws for LLM Red-Teaming

Abstract

arXiv:2505.20162v1 Announce Type: new Abstract: As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

摘要

随着大语言模型能力和自主性的提升，通过红队测试识别漏洞对安全部署变得至关重要。然而，当红队测试转变为强弱对抗问题（即目标模型能力超越红队测试者）时，传统的提示工程方法可能失效。为研究这一转变，我们从攻击者与目标之间的能力差距视角重新审视红队测试框架。通过基于LLM的越狱攻击模拟人类红队测试者，我们评估了涵盖不同模型家族、规模和能力水平的500多个攻击者-目标组合，发现三个显著趋势：(i) 能力更强的模型具备更优攻击性；(ii) 当目标能力超越攻击者时，攻击成功率急剧下降；(iii) 攻击成功率与MMLU-Pro基准测试中社会科学板块的高表现呈正相关。基于这些趋势，我们推导出越狱攻击的缩放定律，可根据攻击者-目标能力差距预测固定目标的攻击成功率。这些发现表明：固定能力攻击者（如人类）可能对未来模型失效；日益强大的开源模型会放大现有系统风险；模型提供商必须精确测量并控制模型的劝说与操控能力，以限制其作为攻击者的有效性。

An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation

Abstract

arXiv:2505.20182v1 Announce Type: new Abstract: We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model's performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.

摘要

我们研究了强弱语言模型在仓库级代码生成中的成本效益协作机制，其中弱模型以较低成本处理简单任务，而最具挑战性的任务则委托给强模型。尽管已有许多研究提出该任务的架构方案，但鲜有工作系统分析性能与成本的关系。我们在GitHub问题解决场景下评估了多种协作策略：基于上下文的、基于管道的以及动态策略。实验表明，最优协作策略在保持与强模型同等性能的同时可降低40%成本。基于研究发现，我们提出了在不同预算和性能约束下选择协作策略的实用指南。结果表明强弱协作能以极小成本显著提升弱模型性能，其中管道式和基于上下文的方法效率最高。本研究代码已发布于https://github.com/shubhamrgandhi/codegen-strong-weak-collab。

Program of Equations Thoughts to Solve Algebra Word Problems

Abstract

arXiv:2505.20170v1 Announce Type: new Abstract: Solving algebraic word problems (AWPs) has recently emerged as an important natural language processing task. Recently, large language models (LLMs) have demonstrated powerful mathematical capabilities, and the Chain-of-Thought technique, which guides LLMs through step-by-step reasoning, has yielded impressive results. However, this reasoning ability is limited by the computational weaknesses of LLMs themselves, where calculation errors can accumulate, leading to incorrect final answers. To address this, we propose Program of Equations Thoughts (POET), which transforms the task of generating step-by-step reasoning answers into a two-stage task of predicting equations and generating code, offloading complex computations to a Python interpreter to avoid calculation errors in LLMs. Furthermore, we propose Zero-shot POET, which utilizes a manually designed template to enable LLMs to directly generate Python code for one-step solving. Our method achieves accuracies of 95.3% and 98.0% on the PEN and ALG514 datasets, respectively, setting a new state-of-the-art (SOTA). Zero-shot POET also achieves the SOTA result of 95.5% on the DRAW-1K dataset.

摘要

解决代数应用题（AWP）近年来已成为自然语言处理领域的重要任务。当前，大语言模型（LLM）展现出强大的数学能力，而引导模型逐步推理的思维链技术已取得显著成果。然而，这种推理能力受限于LLM自身的计算缺陷——计算误差会逐步累积并导致最终答案错误。为此，我们提出方程程序思维（POET）方法，将生成逐步推理答案的任务转化为预测方程与生成代码的两阶段任务，将复杂计算卸载至Python解释器以避免LLM的计算错误。此外，我们提出零样本POET，通过人工设计模板使LLM能直接生成一步求解的Python代码。本方法在PEN和ALG514数据集上分别达到95.3%和98.0%的准确率，创造了最新最优（SOTA）结果。零样本POET在DRAW-1K数据集上也实现了95.5%的SOTA性能。

Temporal Sampling for Forgotten Reasoning in LLMs

Abstract

arXiv:2505.20196v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

摘要

微调大语言模型（LLMs）旨在提升其推理能力，但我们发现了一个反直觉的现象：模型往往会遗忘训练过程中曾正确解决的问题。我们将这种现象称为"时序遗忘"，并证明其普遍存在于不同模型规模、微调方法（包括强化学习和监督微调）以及多个推理基准测试中。为应对这一问题，我们提出"时序采样"——一种简单的解码策略，通过从训练轨迹中的多个检查点抽取输出来重建被遗忘的解决方案。该方法无需重新训练或集成模型，即可显著提升推理性能：在Pass@k指标上获得4至19分的提升，并在多个基准测试的Majority@k中实现持续增益。我们进一步将该方法拓展至LoRA适配模型，证明仅存储检查点间的适配器权重即可获得相似效益，且存储成本极低。通过利用训练过程中固有的时序多样性，时序采样提供了一种实用且计算高效的方法来挖掘隐藏的推理能力，并促使我们重新思考如何评估大语言模型。

Simulating Macroeconomic Expectations using LLM Agents

Abstract

arXiv:2505.17648v1 Announce Type: cross Abstract: We introduce a novel framework for simulating macroeconomic expectation formation using Large Language Model-Empowered Agents (LLM Agents). By constructing thousands of LLM Agents equipped with modules for personal characteristics, prior expectations, and knowledge, we replicate a survey experiment involving households and experts on inflation and unemployment. Our results show that although the expectations and thoughts generated by LLM Agents are more homogeneous than those of human participants, they still effectively capture key heterogeneity across agents and the underlying drivers of expectation formation. Furthermore, a module-ablation exercise highlights the critical role of prior expectations in simulating such heterogeneity. This approach complements traditional survey methods and offers new insights into AI behavioral science in macroeconomic research.

摘要

我们提出了一种利用大语言模型赋能智能体（LLM Agents）模拟宏观经济预期形成的新框架。通过构建数千个配备个人特征模块、先验预期模块和知识模块的LLM智能体，我们复现了针对家庭和专家关于通胀与失业预期的调查实验。研究结果表明：尽管LLM智能体生成的预期和观点比人类参与者更具同质性，但仍能有效捕捉不同智能体间的关键异质性以及预期形成的深层驱动因素。进一步的模块消融实验突显了先验预期在模拟此类异质性中的核心作用。该方法不仅是对传统调查手段的重要补充，更为宏观经济研究中的人工智能行为科学提供了新的研究视角。

InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models

Abstract

arXiv:2505.18156v1 Announce Type: cross Abstract: Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt-based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate LLMs. The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML-based simulation tests. A Python tool supports easy execution of prompt-based test cases. This paper outlines the framework's structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community-driven foundation for securing language models.

摘要

大型语言模型（LLMs）正在改变人们与技术交互的方式。诸如ChatGPT和Claude AI等工具现已广泛应用于商业、研究和日常生活。然而，随着其发展，新的风险也随之而来，尤其是利用这些模型语言处理机制的提示型攻击。InjectLab是一个旨在解决该问题的安全框架。本文介绍InjectLab作为一种结构化、开源矩阵，用于映射现实世界中操纵LLMs的技术。该框架受MITRE ATT&CK启发，特别关注提示层的对抗行为，包含六大核心策略下的25种以上技术，涵盖指令覆盖、身份切换和多智能体利用等威胁。InjectLab中的每种技术均包含检测指南、缓解策略及基于YAML的模拟测试，并配备Python工具以支持便捷执行提示型测试用例。本文概述了该框架的结构，将其与其他AI威胁分类法进行比较，并探讨其作为保护语言模型的实践性、社区驱动基础框架的未来发展方向。

On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Abstract

arXiv:2505.20246v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.

摘要

尽管大语言模型（LLMs）的最新进展在各领域取得了显著成就，但其在人文学科尤其是历史学中的能力仍待深入探索。历史推理对人工智能提出了独特挑战，涉及多模态史料解读、时序推理及跨语言分析。虽然通用智能体在现有基准测试中表现良好，但它们缺乏处理历史材料和问题所需的领域专业知识。为填补这一空白，我们推出了HistBench——一个包含414道高质量问题的全新基准测试，由40余位专家共同设计，旨在评估AI的历史推理能力。这些任务涵盖广泛的历史问题，包括基于原始史实检索、手稿与图像的诠释分析，以及涉及考古学、语言学或文化史等跨学科挑战。此外，该基准数据集涵盖29种古今语言，跨越多个历史时期和世界区域。针对LLMs及其他智能体在HistBench上的较差表现，我们进一步提出HistAgent——一个专为历史研究设计的智能体，配备精心构建的OCR、翻译、档案检索和图像理解工具。基于GPT-4o的HistAgent在HistBench上取得了27.54%的pass@1准确率和36.47%的pass@2准确率，显著优于具备在线搜索功能的LLMs及通用智能体（包括GPT-4o的18.60%、DeepSeek-R1的14.49%以及Open Deep Research-smolagents的20.29% pass@1和25.12% pass@2）。这些结果既揭示了现有LLMs与通用智能体的局限性，也验证了HistAgent在历史推理中的优势。

Model-Distributed Inference for Large Language Models at the Edge

Abstract

arXiv:2505.18164v1 Announce Type: cross Abstract: We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the "recurrent pipeline parallelism" technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.

摘要

我们提出了一种面向大语言模型的模型分布式推理框架（MDI-LLM），该创新框架旨在促进最先进的大语言模型在边缘低功耗设备上的部署。该框架通过将模型划分为多个分区，并将其分配到网络中的不同设备/节点来实现。这些节点通过设备间链路交换中间激活向量，从而实现协同计算。为提高该过程的效率，我们提出了"循环流水线并行"技术，该技术可减少每个设备的空闲时间，并在生成多个文本序列时实现并行推理。通过利用多个边缘设备的组合计算资源，MDI-LLM能够部署超出单个设备内存容量的大语言模型，使得在低成本硬件上执行推理成为可能。此外，随着参与设备数量的增加，MDI-LLM可提升令牌生成吞吐量并降低每个设备的内存消耗。

syftr: Pareto-Optimal Generative AI

Abstract

arXiv:2505.20266v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, building effective RAG flows is complex, requiring careful selection among vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. The challenge deepens with the rise of agentic paradigms. Modules like verifiers, rewriters, and rerankers-each with intricate hyperparameter dependencies have to be carefully tuned. Balancing tradeoffs between latency, accuracy, and cost becomes increasingly difficult in performance-sensitive applications. We introduce syftr, a framework that performs efficient multi-objective search over a broad space of agentic and non-agentic RAG configurations. Using Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple RAG benchmarks, syftr finds flows which are on average approximately 9 times cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr's ability to design and optimize allows integrating new modules, making it even easier and faster to realize high-performing generative AI pipelines.

摘要

检索增强生成（RAG）流程是将大语言模型（LLMs）应用于专有或动态数据的核心。然而，构建高效的RAG流程十分复杂，需要在向量数据库、嵌入模型、文本分割器、检索器和合成LLMs之间进行谨慎选择。随着代理范式的兴起，这一挑战进一步加深。验证器、改写器和重排序器等模块——每个模块都具有复杂的超参数依赖关系——必须仔细调优。在性能敏感的应用中，平衡延迟、准确性和成本之间的权衡变得越来越困难。我们提出了syftr框架，该框架能在广泛的代理和非代理RAG配置空间中进行高效的多目标搜索。通过贝叶斯优化，syftr发现了能同时优化任务准确性和成本的帕累托最优流程。一种新颖的早期停止机制通过剪枝明显次优的候选方案，进一步提高了效率。在多个RAG基准测试中，syftr发现的流程平均比帕累托前沿上最准确的流程便宜约9倍，同时保留了其大部分准确性。此外，syftr的设计和优化能力允许集成新模块，使得实现高性能生成式AI流程更加便捷和快速。

Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution

Abstract

arXiv:2505.20286v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have enabled agents to autonomously perform complex, open-ended tasks. However, many existing frameworks depend heavily on manually predefined tools and workflows, which hinder their adaptability, scalability, and generalization across domains. In this work, we introduce Alita--a generalist agent designed with the principle of "Simplicity is the ultimate sophistication," enabling scalable agentic reasoning through minimal predefinition and maximal self-evolution. For minimal predefinition, Alita is equipped with only one component for direct problem-solving, making it much simpler and neater than previous approaches that relied heavily on hand-crafted, elaborate tools and workflows. This clean design enhances its potential to generalize to challenging questions, without being limited by tools. For Maximal self-evolution, we enable the creativity of Alita by providing a suite of general-purpose components to autonomously construct, refine, and reuse external capabilities by generating task-related model context protocols (MCPs) from open source, which contributes to scalable agentic reasoning. Notably, Alita achieves 75.15% pass@1 and 87.27% pass@3 accuracy, which is top-ranking among general-purpose agents, on the GAIA benchmark validation dataset, 74.00% and 52.00% pass@1, respectively, on Mathvista and PathVQA, outperforming many agent systems with far greater complexity. More details will be updated at $\href{https://github.com/CharlesQ9/Alita}{https://github.com/CharlesQ9/Alita}$ .

摘要

大语言模型（LLM）的最新进展使得智能体能够自主执行复杂的开放式任务。然而，现有框架大多严重依赖手动预定义工具与工作流，这限制了其跨领域的适应性、可扩展性和泛化能力。本研究提出Alita——一个遵循"至简即至臻"原则设计的通用智能体，通过最小化预定义与最大化自我进化实现可扩展的自主推理。在最小化预定义方面，Alita仅配备单一直接问题解决组件，相比依赖大量手工构建复杂工具链的现有方案更为简洁。这种简洁设计增强了其应对挑战性问题的泛化潜力，不受工具限制。在最大化自我进化方面，我们通过开源模型上下文协议（MCP）生成机制，提供通用组件套件使智能体能自主构建、优化和复用外部能力，从而激发创造力并实现可扩展的自主推理。值得注意的是，Alita在GAIA基准验证集上达到75.15% pass@1和87.27% pass@3准确率，位列通用智能体榜首；在Mathvista和PathVQA上分别取得74.00%和52.00% pass@1，性能超越许多复杂度更高的智能体系统。更多细节将持续更新于https://github.com/CharlesQ9/Alita。

LA-RCS: LLM-Agent-Based Robot Control System

Abstract

arXiv:2505.18214v1 Announce Type: cross Abstract: LA-RCS (LLM-agent-based robot control system) is a sophisticated robot control system designed to autonomously plan, work, and analyze the external environment based on user requirements by utilizing LLM-Agent. Utilizing a dual-agent framework, LA-RCS generates plans based on user requests, observes the external environment, executes the plans, and modifies the plans as needed to adapt to changes in the external conditions. Additionally, LA-RCS interprets natural language commands by the user and converts them into commands compatible with the robot interface so that the robot can execute tasks and meet user requests properly. During his process, the system autonomously evaluates observation results, provides feedback on the tasks, and executes commands based on real-time environmental monitoring, significantly reducing the need for user intervention in fulfilling requests. We categorized the scenarios that LA-RCS needs to perform into four distinct types and conducted a quantitative assessment of its performance in each scenario. The results showed an average success rate of 90 percent, demonstrating the system capability to fulfill user requests satisfactorily. For more extensive results, readers can visit our project page: https://la-rcs.github.io

Towards medical AI misalignment: a preliminary study

Abstract

arXiv:2505.18212v1 Announce Type: cross Abstract: Despite their staggering capabilities as assistant tools, often exceeding human performances, Large Language Models (LLMs) are still prone to jailbreak attempts from malevolent users. Although red teaming practices have already identified and helped to address several such jailbreak techniques, one particular sturdy approach involving role-playing (which we named `Goofy Game') seems effective against most of the current LLMs safeguards. This can result in the provision of unsafe content, which, although not harmful per se, might lead to dangerous consequences if delivered in a setting such as the medical domain. In this preliminary and exploratory study, we provide an initial analysis of how, even without technical knowledge of the internal architecture and parameters of generative AI models, a malicious user could construct a role-playing prompt capable of coercing an LLM into producing incorrect (and potentially harmful) clinical suggestions. We aim to illustrate a specific vulnerability scenario, providing insights that can support future advancements in the field.

摘要

尽管大型语言模型（LLM）作为辅助工具展现出惊人能力且常超越人类表现，但其仍易受到恶意用户的越狱攻击。虽然红队测试已识别并协助修复了多种此类越狱技术，但一种名为"滑稽游戏"的角色扮演方法表现出特殊鲁棒性，能有效突破当前大多数LLM防护机制。这可能导致模型输出不安全内容——尽管内容本身无害，但若应用于医疗等领域则可能引发严重后果。在本探索性初步研究中，我们首次分析了恶意用户如何在无需了解生成式AI模型内部架构与技术参数的情况下，通过构建角色扮演提示词迫使LLM生成错误（且具潜在危害性）的临床建议。本研究旨在揭示特定漏洞场景，为未来该领域的安全防护研究提供理论依据。

ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge

Abstract

arXiv:2505.18217v1 Announce Type: cross Abstract: Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.

摘要

自然场景下的语音情感识别（SER）由于内在变异性、多样化的录音条件以及类别不平衡等问题，仍面临挑战。作为聚焦这些复杂性的Interspeech自然场景SER挑战赛参赛者，我们提出Abhinaya系统，该系统整合了基于语音、文本及语音-文本的模型。我们的方法通过微调自监督语音大语言模型（SLLM）获取语音表征，利用大语言模型（LLM）提取文本上下文，并采用SLLM进行语音-文本建模以捕捉细微情感线索。为应对类别不平衡，我们应用定制化损失函数并通过多数投票生成分类决策。尽管其中一个模型未完全训练，Abhinaya系统仍在166份提交中排名第4。在完成训练后，该系统在已发表成果中达到了最先进的性能，证明了我们提出的方法在真实场景SER任务中的有效性。

Large Language Model-Driven Distributed Integrated Multimodal Sensing and Semantic Communications

Abstract

arXiv:2505.18194v1 Announce Type: cross Abstract: Traditional single-modal sensing systems-based solely on either radio frequency (RF) or visual data-struggle to cope with the demands of complex and dynamic environments. Furthermore, single-device systems are constrained by limited perspectives and insufficient spatial coverage, which impairs their effectiveness in urban or non-line-of-sight scenarios. To overcome these challenges, we propose a novel large language model (LLM)-driven distributed integrated multimodal sensing and semantic communication (LLM-DiSAC) framework. Specifically, our system consists of multiple collaborative sensing devices equipped with RF and camera modules, working together with an aggregation center to enhance sensing accuracy. First, on sensing devices, LLM-DiSAC develops an RF-vision fusion network (RVFN), which employs specialized feature extractors for RF and visual data, followed by a cross-attention module for effective multimodal integration. Second, a LLM-based semantic transmission network (LSTN) is proposed to enhance communication efficiency, where the LLM-based decoder leverages known channel parameters, such as transceiver distance and signal-to-noise ratio (SNR), to mitigate semantic distortion. Third, at the aggregation center, a transformer-based aggregation model (TRAM) with an adaptive aggregation attention mechanism is developed to fuse distributed features and enhance sensing accuracy. To preserve data privacy, a two-stage distributed learning strategy is introduced, allowing local model training at the device level and centralized aggregation model training using intermediate features. Finally, evaluations on a synthetic multi-view RF-visual dataset generated by the Genesis simulation engine show that LLM-DiSAC achieves a good performance.

摘要

传统基于单一模态感知系统（仅依赖射频或视觉数据）难以应对复杂动态环境的需求。此外，单设备系统受限于视角狭窄和空间覆盖不足，在城区或非视距场景中效能受限。为突破这些限制，我们提出一种新型大语言模型驱动的分布式集成多模态感知与语义通信框架（LLM-DiSAC）。该系统由多个配备射频与摄像模块的协同感知设备构成，通过与汇聚中心协作提升感知精度。首先，在感知设备端，LLM-DiSAC开发了射频-视觉融合网络（RVFN），采用专用特征提取器处理射频与视觉数据，并通过交叉注意力模块实现高效多模态融合。其次，提出基于大语言模型的语义传输网络（LSTN）以提升通信效率，其中基于大语言模型的解码器利用收发距离、信噪比等已知信道参数来抑制语义失真。第三，在汇聚中心端开发了具有自适应聚合注意力机制的Transformer聚合模型（TRAM），用于融合分布式特征并提升感知精度。为保护数据隐私，采用两阶段分布式学习策略：在设备端进行本地模型训练，同时利用中间特征进行集中式聚合模型训练。最终，基于Genesis仿真引擎生成的合成多视角射频-视觉数据集验证表明，LLM-DiSAC实现了优越性能。

CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games

Abstract

arXiv:2505.18218v1 Announce Type: cross Abstract: Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents' ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games - Undercover and Adversarial Taboo - which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents' ability to communicate strategically using metaphors.

摘要

隐喻是人类通过将一个概念与另一领域的概念相比较来表达复杂或微妙思想的重要手段。然而，许多大语言模型（LLM）在多智能体语言游戏中难以理解和运用隐喻，这阻碍了其进行隐蔽交流和语义规避的能力，而这些能力对策略性沟通至关重要。为解决这一挑战，我们提出了CoMet框架，使基于LLM的智能体能够进行隐喻处理。CoMet将基于假设的隐喻推理器与通过自我反思和知识整合改进的隐喻生成器相结合，从而增强了智能体解释和应用隐喻的能力，提升了其交互的策略性和微妙性。我们在两款侧重隐蔽交流与语义规避的多智能体语言游戏——《Undercover》和《Adversarial Taboo》上评估了CoMet。实验结果表明，CoMet显著提升了智能体使用隐喻进行策略性沟通的能力。

Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?

Abstract

arXiv:2505.18215v1 Announce Type: cross Abstract: The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.

摘要

大型语言模型（LLM）的快速普及掩盖了传统BERT类模型在文本分类中的潜在优势。本研究通过系统比较三类方法（即BERT类模型微调、LLM内部状态利用和零样本推理）在六个高难度数据集上的表现，对当前"以LLM为中心"的主流趋势提出挑战。实验结果表明，BERT类模型往往优于LLMs。我们进一步将数据集划分为三种类型，进行主成分分析和探测实验，发现任务特异性模型优势：BERT类模型擅长模式驱动型任务，而LLMs在需要深度语义或世界知识的任务中表现更优。基于此，我们提出细粒度任务选择策略TaMAS，倡导根据具体任务特性选择模型，而非盲目依赖LLMs的"一刀切"方案。

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Abstract

arXiv:2505.18223v1 Announce Type: cross Abstract: Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs' multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.

摘要

大语言模型（LLMs）作为数据分析代理展现出潜力，但现有基准测试忽视了该领域的迭代特性——专家的决策会随着对数据集理解的深入而演变。为此，我们提出IDA-Bench，这是一个评估LLM代理在多轮交互场景中表现的新型基准。该基准源自复杂的Kaggle笔记本，任务以LLM模拟用户发出的序列化自然语言指令形式呈现。代理性能通过将其最终数值输出与人工基准进行对比来评判。初步结果显示，即使是Claude-3.7-thinking等最先进的编码代理，其任务成功率也低于50%，这揭示了单轮测试中无法体现的局限性。本研究强调需提升LLMs的多轮交互能力以构建更可靠的数据分析代理，同时指出必须在指令遵循与推理能力之间取得平衡。

Navigating Pitfalls: Evaluating LLMs in Machine Learning Programming Education

Abstract

arXiv:2505.18220v1 Announce Type: cross Abstract: The rapid advancement of Large Language Models (LLMs) has opened new avenues in education. This study examines the use of LLMs in supporting learning in machine learning education; in particular, it focuses on the ability of LLMs to identify common errors of practice (pitfalls) in machine learning code, and their ability to provide feedback that can guide learning. Using a portfolio of code samples, we consider four different LLMs: one closed model and three open models. Whilst the most basic pitfalls are readily identified by all models, many common pitfalls are not. They particularly struggle to identify pitfalls in the early stages of the ML pipeline, especially those which can lead to information leaks, a major source of failure within applied ML projects. They also exhibit limited success at identifying pitfalls around model selection, which is a concept that students often struggle with when first transitioning from theory to practice. This questions the use of current LLMs to support machine learning education, and also raises important questions about their use by novice practitioners. Nevertheless, when LLMs successfully identify pitfalls in code, they do provide feedback that includes advice on how to proceed, emphasising their potential role in guiding learners. We also compare the capability of closed and open LLM models, and find that the gap is relatively small given the large difference in model sizes. This presents an opportunity to deploy, and potentially customise, smaller more efficient LLM models within education, avoiding risks around cost and data sharing associated with commercial models.

摘要

大型语言模型（LLMs）的快速发展为教育领域开辟了新途径。本研究探讨了LLMs在机器学习教育中支持学习的应用，重点关注其识别机器学习代码中常见实践错误（陷阱）的能力，以及提供学习指导反馈的能力。通过一组代码样本组合，我们评估了四种不同LLMs：一种闭源模型和三种开源模型。虽然所有模型都能轻松识别最基本的陷阱，但对许多常见陷阱却无法识别。这些模型尤其难以识别机器学习流程早期阶段的陷阱，特别是可能导致信息泄露的陷阱——这是应用机器学习项目失败的主要根源。此外，在模型选择相关的陷阱识别上，这些模型表现有限，而该概念正是学生从理论转向实践时经常遇到的难点。这对当前LLMs支持机器学习教育的适用性提出了质疑，同时也引发了关于新手从业者使用这些模型的重要问题。然而，当LLMs成功识别代码中的陷阱时，其反馈确实包含后续操作建议，凸显了其在学习引导方面的潜在作用。我们还比较了闭源与开源LLM模型的能力，发现尽管模型规模差异显著，但性能差距相对较小。这为在教育领域部署（并可能定制）更高效的小型LLM模型提供了机遇，同时规避了商业模型在成本和数据共享方面的风险。

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Abstract

arXiv:2505.18227v1 Announce Type: cross Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

摘要

在Transformer架构中，通过将输入分割为固定长度块来形成标记——这些从原始数据中提取的离散单元。每个标记随后被映射为嵌入表示，从而在保留输入核心信息的同时实现并行注意力计算。由于Transformer自注意力机制具有二次计算复杂度，标记缩减技术主要被用作效率优化策略，这在单模态视觉和语言领域尤为明显，因其有助于平衡计算成本、内存占用和推理延迟。尽管已有这些进展，本文主张在大模型时代，标记缩减应当超越其传统效率导向的角色。我们将其重新定位为生成建模的基础原则，认为其对模型架构和更广泛的应用具有关键影响。具体而言，我们论证了在视觉、语言和多模态系统中，标记缩减能够：（i）促进更深层次的多模态融合与对齐；（ii）缓解'过度思考'和幻觉现象；（iii）保持长输入序列的连贯性；（iv）提升训练稳定性等。我们将标记重新定义为超越效率优化的核心要素，并据此勾勒出未来研究方向，包括算法设计、强化学习引导的标记缩减、上下文学习中的标记优化，以及更广泛的机器学习和科学领域应用。我们强调该技术有望推动新型模型架构和学习策略的发展，从而提升模型鲁棒性、增强可解释性，并更好地契合生成建模的目标。

NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

Abstract

arXiv:2505.18231v1 Announce Type: cross Abstract: Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3 $\times$ throughput gain over full-precision baselines.

摘要

大语言模型（LLM）推理过程通常具有较高的内存需求，尤其是在处理大批量数据和长序列时，关键值（KV）缓存的大容量是主要原因。向量量化（VQ）技术近期被用于缓解这一问题，但我们发现现有方法因依赖校准数据集而易受分布偏移影响。为克服这一局限，本文提出NSNQuant——一种无需校准的向量量化技术，专为KV缓存的低位压缩设计。该方法通过三步变换（1）词元级归一化（Normalize）、（2）通道级中心化（Shift）、（3）二次词元级归一化（Normalize）结合Hadamard变换，将词元分布有效对齐标准正态分布。这种对齐方式实现了基于单一可复用码本的稳健、免校准向量量化。大量实验表明，NSNQuant在1比特和2比特设置下均优于现有方法，展现出强泛化能力，相比全精度基线最高可获得3倍吞吐量提升。

Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback

Abstract

arXiv:2505.18240v1 Announce Type: cross Abstract: The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.

摘要

在生成式人工智能时代，自动生成演示文稿幻灯片是一个重要课题。本文重点评估能够有效总结文档内容并向广泛受众传递概念的多模态演示文稿内容。我们引入了一个基准数据集RefSlides，该数据集包含涵盖多个主题的人工制作高质量演示文稿。接着，我们提出一组用于表征演示文稿内容不同内在特性的指标，并提出了REFLEX评估方法——该方法能针对这些指标生成评分和可操作的反馈。我们通过生成具有不同程度指标特异性扰动的负面演示样本，并利用这些样本来微调大语言模型，从而实现这一目标。这种无参考评估技术在推理过程中不需要真实演示文稿作为基准。大量自动化及人工实验表明，我们的评估方法在生成评分和解释方面优于传统的基于启发式方法和最先进的大语言模型评估方法。

The Origins of Representation Manifolds in Large Language Models

Abstract

arXiv:2505.18235v1 Announce Type: cross Abstract: There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.

摘要

机制可解释性研究领域正致力于将人工智能系统的嵌入和内部表征映射为人类可理解的概念，其中线性表征假说是该研究的核心要素。该假说认为神经表征是由"近似正交"的方向向量构成的稀疏线性组合，反映不同特征的存在与否。这一模型支撑了使用稀疏自编码器从表征中恢复特征的方法。近期学界热议的焦点是构建更完整的特征模型，使神经表征不仅能编码特征的存在性，还能表达特征的潜在连续多维取值。本文阐述了特征为何及如何被表征为流形，特别论证了表征空间中的余弦相似性可能通过流形上的最短路径编码特征的内在几何结构，这或许能解释表征空间距离与概念空间关联性之间的联系。该理论的关键假设和预测在大型语言模型的文本嵌入与标记激活上得到了验证。

ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning

Abstract

arXiv:2505.18232v1 Announce Type: cross Abstract: The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.

摘要

大型语言模型（LLMs）在许多领域的应用因其高昂的计算和内存成本而受到严重制约。近期研究表明，LLMs具有稀疏性特征，这一特性可用于模型剪枝。传统剪枝方法通常遵循"先剪枝后微调"的范式。由于被剪枝部分仍包含有价值信息，静态移除这些参数而不更新剩余参数往往会导致不可逆的性能下降，需要昂贵的恢复性微调（RFT）来维持性能。为解决这一问题，我们提出了一种新范式：先进行正则化处理，再实施剪枝。基于此范式，我们提出了ELDeR：通过数据驱动的正则化分层剪枝实现高效LLMs。该方法首先为每个Transformer层的输出乘以初始权重，随后通过少量数据以简单方式迭代学习各层的权重参数。之后对权重较小层的输入输出差异施加正则化约束，迫使信息转移至保留层。与直接剪枝相比，ELDeR显著降低了参数直接移除造成的信息损失，从而更好地保持了模型的语言建模能力。实验结果表明，相较于强大的分层结构化剪枝方法，ELDeR在取得更优性能的同时大幅降低了RFT计算成本。由于ELDeR采用分层剪枝策略，其端到端加速效果显著，为构建高效LLMs提供了极具前景的技术方案。

Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens

Abstract

arXiv:2505.18237v1 Announce Type: cross Abstract: The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.

摘要

近年来，大型推理模型（LRMs）的兴起显著提升了多步推理性能，但往往伴随生成冗余推理链的问题。本文通过信息论视角重新审视推理过程的效率，揭示了推理长度与语义效率之间的本质权衡。我们提出InfoBias（信息偏差）和InfoGain（信息增益）两个量化指标，分别用于衡量推理路径与理想状态的偏离程度及步骤间信息贡献。实证分析表明，过长的推理链通常伴随更高信息偏差和递减的信息增益，错误答案中这种现象尤为显著。基于此发现，我们提出基于信息熵的自适应推理策略（Adaptive Think），在置信度达标时动态终止推理过程。相比基线方法（Vanilla Think），该策略在涵盖六类推理任务和难度等级的QwQ-32B基准测试中实现平均准确率提升1.10%，同时减少50.80%的token消耗，展现出卓越的效率和推理性能。这些结果印证了基于信息熵的方法在提升大语言模型部署的准确性与成本效益方面的潜力。

Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models

Abstract

arXiv:2505.18244v1 Announce Type: cross Abstract: Large Transformer based language models achieve remarkable performance but remain opaque in how they plan, structure, and realize text. We introduce Multi_Scale Probabilistic Generation Theory (MSPGT), a hierarchical framework that factorizes generation into three semantic scales_global context, intermediate structure, and local word choices and aligns each scale with specific layer ranges in Transformer architectures. To identify scale boundaries, we propose two complementary metrics: attention span thresholds and inter layer mutual information peaks. Across four representative models (GPT-2, BERT, RoBERTa, and T5), these metrics yield stable local/intermediate/global partitions, corroborated by probing tasks and causal interventions. We find that decoder_only models allocate more layers to intermediate and global processing while encoder_only models emphasize local feature extraction. Through targeted interventions, we demonstrate that local scale manipulations primarily influence lexical diversity, intermediate-scale modifications affect sentence structure and length, and global_scale perturbations impact discourse coherence all with statistically significant effects. MSPGT thus offers a unified, architecture-agnostic method for interpreting, diagnosing, and controlling large language models, bridging the gap between mechanistic interpretability and emergent capabilities.

摘要

基于Transformer的大型语言模型表现出卓越性能，但其文本生成过程中的规划、结构与实现机制仍不透明。本研究提出多尺度概率生成理论（MSPGT），该分层框架将生成过程分解为三个语义尺度：全局语境、中间结构和局部词汇选择，并将每个尺度与Transformer架构的特定层级范围对应。为确定尺度边界，我们提出两个互补指标：注意力跨度阈值和层间互信息峰值。在四种代表性模型（GPT-2、BERT、RoBERTa和T5）上的实验表明，这些指标能稳定划分局部/中间/全局层级分区，该结果通过探测任务和因果干预得到验证。研究发现：纯解码器模型将更多层级分配给中间和全局处理，而纯编码器模型更侧重局部特征提取。通过定向干预实验证实：局部尺度调控主要影响词汇多样性，中间尺度修改改变句子结构和长度，全局尺度扰动则影响语篇连贯性——所有效应均具有统计显著性。MSPGT理论由此提供了一种架构无关的统一方法，可用于大型语言模型的解释、诊断与控制，在机制可解释性与涌现能力之间架设了桥梁。

MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning

Abstract

arXiv:2505.18247v1 Announce Type: cross Abstract: Despite the widespread exploration of Retrieval-Augmented Generation (RAG), its deployment in enterprises for domain-specific datasets remains limited due to poor answer accuracy. These corpora, often shielded behind firewalls in private enterprise knowledge bases, having complex, domain-specific terminology, rarely seen by LLMs during pre-training; exhibit significant semantic variability across domains (like networking, military, or legal, etc.), or even within a single domain like medicine, and thus result in poor context precision for RAG systems. Currently, in such situations, fine-tuning or RAG with fine-tuning is attempted, but these approaches are slow, expensive, and lack generalization for accuracy as the new domain-specific data emerges. We propose an approach for Enterprise Search that focuses on enhancing the retriever for a domain-specific corpus through hybrid query indexes and metadata enrichment. This 'MetaGen Blended RAG' method constructs a metadata generation pipeline using key concepts, topics, and acronyms, and then creates a metadata-enriched hybrid index with boosted search queries. This approach avoids overfitting and generalizes effectively across domains. On the PubMedQA benchmark for the biomedical domain, the proposed method achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all previous RAG accuracy results without fine-tuning and sets a new benchmark for zero-shot results while outperforming much larger models like GPT3.5. The results are even comparable to the best fine-tuned models on this dataset, and we further demonstrate the robustness and scalability of the approach by evaluating it on other Q&A datasets like SQuAD, NQ etc.

摘要

尽管检索增强生成（RAG）技术已被广泛探索，但由于答案准确性不足，其在企业领域特定数据集中的部署仍受限。这些通常位于企业私有知识库防火墙后的语料库具有复杂且领域专用的术语（如网络、军事或法律等领域），这些术语在LLMs预训练阶段极少出现；同时不同领域（甚至医学等单一领域内部）存在显著的语义差异性，导致RAG系统的上下文精确度低下。当前此类场景通常尝试微调或"微调+RAG"方案，但这些方法存在速度慢、成本高且随新增领域数据出现时泛化能力不足的缺陷。我们提出一种企业搜索解决方案，通过混合查询索引与元数据增强来优化领域专用语料库的检索器。该"MetaGen混合RAG"方法构建了基于关键概念、主题及缩略词的元数据生成管道，继而创建具有增强搜索查询的元数据混合索引。该方法避免了过拟合问题并能有效实现跨领域泛化。在生物医学领域的PubMedQA基准测试中，所提方法取得82%的检索准确率和77%的RAG准确率，超越所有无需微调的既往RAG精度结果，为零样本效果树立了新基准，同时优于GPT3.5等更大规模模型。其效果甚至可媲美该数据集上最佳微调模型，我们进一步通过SQuAD、NQ等问答数据集验证了该方法的鲁棒性与可扩展性。

Abstract

arXiv:2505.18322v1 Announce Type: cross Abstract: LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.

摘要

已有研究表明，大型语言模型（LLMs）与西方或北美文化价值观保持一致。先前工作主要通过直接询问（最初是人类，现在也包括LLMs）其价值观的调查来证明这一效应。然而，很难相信LLMs会在现实场景中始终如一地应用这些价值观。为此，我们采用自下而上的方法，要求LLMs对不同文化叙事中的文化规范进行推理。我们发现，GPT-4倾向于生成的规范虽然不一定错误，但显著缺乏文化特异性。此外，尽管它避免公然生成刻板印象，但某些文化的刻板表征在模型中只是被隐藏而非消除，这类刻板印象很容易被恢复。解决这些挑战是开发能够公平服务多元化用户群体的LLMs的关键一步。

TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification

Abstract

arXiv:2505.18283v1 Announce Type: cross Abstract: Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at https://github.com/JianghaoWu/TAGS.

摘要

近期诸如思维链提示等进展显著提升了大型语言模型（LLMs）在零样本医疗推理任务中的表现。然而，基于提示的方法往往存在浅层推理和不稳定的问题，而经过微调的医疗LLMs则在分布偏移下泛化能力不足，且对未见临床场景的适应性有限。为应对这些局限性，我们提出TAGS框架——一种测试时方法，通过将通用基础模型与领域专家模型相结合，在不进行任何模型微调或参数更新的情况下提供互补视角。为支持这种通用-专家协同推理机制，我们引入两个辅助模块：分层检索机制（通过语义和原理级相似性筛选示例，提供多尺度参考样本）和可靠性评分器（评估推理一致性以指导最终答案聚合）。TAGS在九项MedQA基准测试中表现优异，将GPT-4o准确率提升13.8%，DeepSeek-R1提升16.8%，并将基础7B模型性能从14.1%提升至23.9%。这些结果超越了多个经过微调的医疗LLMs，且无需任何参数更新。代码将在https://github.com/JianghaoWu/TAGS发布。

Abstract

arXiv:2505.18341v1 Announce Type: cross Abstract: Training and evaluating autonomous driving algorithms requires a diverse range of scenarios. However, most available datasets predominantly consist of normal driving behaviors demonstrated by human drivers, resulting in a limited number of safety-critical cases. This imbalance, often referred to as a long-tail distribution, restricts the ability of driving algorithms to learn from crucial scenarios involving risk or failure, scenarios that are essential for humans to develop driving skills efficiently. To generate such scenarios, we utilize Multi-modal Large Language Models to convert crash reports of accidents into a structured scenario format, which can be directly executed within simulations. Specifically, we introduce CrashAgent, a multi-agent framework designed to interpret multi-modal real-world traffic crash reports for the generation of both road layouts and the behaviors of the ego vehicle and surrounding traffic participants. We comprehensively evaluate the generated crash scenarios from multiple perspectives, including the accuracy of layout reconstruction, collision rate, and diversity. The resulting high-quality and large-scale crash dataset will be publicly available to support the development of safe driving algorithms in handling safety-critical situations.

摘要

训练和评估自动驾驶算法需要多样化的场景。然而，现有数据集主要由人类驾驶员展示的正常驾驶行为构成，导致安全关键案例数量有限。这种通常被称为长尾分布的数据失衡问题，限制了驾驶算法从涉及风险或故障的关键场景中学习的能力，而这些场景对人类高效掌握驾驶技能至关重要。为生成此类场景，我们利用多模态大语言模型将交通事故报告转化为结构化场景格式，使其可直接在仿真环境中执行。具体而言，我们提出CrashAgent——一个多智能体框架，旨在解析多模态真实世界交通事故报告，以生成道路布局、自车及周围交通参与者的行为。我们从布局重建准确性、碰撞率和多样性等多维度对生成的碰撞场景进行全面评估。最终形成的高质量大规模碰撞数据集将公开提供，以支持安全驾驶算法处理关键安全场景的研发工作。

PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

Abstract

arXiv:2505.18331v1 Announce Type: cross Abstract: Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on https://huggingface.co/datasets/NaghmehAI/PerMedCQA

摘要

医疗消费者问答（CQA）通过提供个性化且可靠的健康信息，对增强患者自主权至关重要。尽管目前基于大语言模型（LLM）的医疗问答系统取得进展，但面向消费者且支持多语言的资源——尤其是波斯语等低资源语言——仍然匮乏。为填补这一空白，我们推出首个波斯语基准测试集PerMedCQA，用于评估LLM处理真实世界消费者医疗问题的能力。该数据集从大型医疗问答论坛中精选而成，包含68,138个问答对，是从87,780条原始条目经过严格数据清洗后获得的。我们采用基于量规的新型评估框架MedJudge（由LLM评分器驱动并经专家人工标注验证），对多个最先进的多语言及指令微调LLM进行了评估。研究结果揭示了多语言医疗问答中的关键挑战，并为开发更精准、更具情境感知的医疗辅助系统提供了重要见解。数据已公开于https://huggingface.co/datasets/NaghmehAI/PerMedCQA。

Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?

Abstract

arXiv:2505.18350v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are increasingly being adopted for narrow tasks - such as medical question answering or sentiment analysis - and deployed in resource-constrained settings, a key question arises: how many parameters does a task actually need? In this work, we present LLM-Sieve, the first comprehensive framework for task-specific pruning of LLMs that achieves 20-75% parameter reduction with only 1-5% accuracy degradation across diverse domains. Unlike prior methods that apply uniform pruning or rely on low-rank approximations of weight matrices or inputs in isolation, LLM-Sieve (i) learns task-aware joint projections to better approximate output behavior, and (ii) employs a Genetic Algorithm to discover differentiated pruning levels for each matrix. LLM-Sieve is fully compatible with LoRA fine-tuning and quantization, and uniquely demonstrates strong generalization across datasets within the same task domain. Together, these results establish a practical and robust mechanism to generate smaller performant task-specific models.

摘要

随着大语言模型（LLM）日益应用于特定任务（如医疗问答或情感分析）并部署于资源受限环境，一个关键问题随之产生：特定任务实际需要多少参数量？本研究提出LLM-Sieve框架，这是首个面向任务定制的LLM剪枝综合方案，能在多样化领域实现20-75%的参数削减，同时仅产生1-5%的精度损失。与传统采用均匀剪枝或单独依赖权重矩阵/输入低秩近似的方法不同，LLM-Sieve具有两大创新：(i) 通过任务感知的联合投影学习更精准逼近输出行为；(ii) 采用遗传算法为每个矩阵发现差异化剪枝强度。该框架完全兼容LoRA微调与量化技术，并独特展现出同任务领域内跨数据集的强泛化能力。这些成果共同构建了一个实用且鲁棒的机制，可生成更小规模的高性能任务专用模型。

A Critical Evaluation of Defenses against Prompt Injection Attacks

Abstract

arXiv:2505.18333v1 Announce Type: cross Abstract: Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: https://github.com/PIEval123/PIEval.

摘要

大型语言模型（LLMs）易受提示注入攻击，近期已有若干防御方案被提出，且常宣称能有效缓解此类攻击。然而，我们认为现有研究缺乏评估这些防御措施的体系化方法。本文提出应从两个关键维度进行评估：（1）防御有效性，需针对现有及自适应的提示注入攻击进行测试，涵盖多样化目标提示与注入提示；（2）通用功能性，需确保防御机制不影响LLM的基础能力。批判性评估表明，先前研究均未遵循如此全面的评估方法。当采用本研究的体系化方法进行评估时，我们发现现有防御方案的实际效果远低于既有报道。本工作为未来防御方案的评估与开发提供了方法论基础。代码与数据详见：https://github.com/PIEval123/PIEval。

SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases

Abstract

arXiv:2505.18363v1 Announce Type: cross Abstract: Text-to-SQL systems translate natural language questions into executable SQL queries, and recent progress with large language models (LLMs) has driven substantial improvements in this task. Schema linking remains a critical component in Text-to-SQL systems, reducing prompt size for models with narrow context windows and sharpening model focus even when the entire schema fits. We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to Gemini 2.5 Flash to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries. Despite being simple, cost-effective, and highly scalable, our method achieves state-of-the-art results on the BIRD benchmark, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches. We conduct detailed ablation studies to examine the precision-recall trade-off in our framework. Additionally, we evaluate the execution accuracy of our schema filtering method compared to other approaches across various model sizes.

摘要

文本到SQL系统将自然语言问题转化为可执行的SQL查询，而大型语言模型（LLM）的最新进展显著提升了该任务的性能。模式链接仍是文本到SQL系统的关键组件，它既能缩减上下文窗口有限模型的提示规模，也能在完整模式适配时增强模型专注力。我们提出一种零样本、无需训练的模式链接方法：首先基于外键关系构建模式图，随后使用单一提示通过Gemini 2.5 Flash从用户查询中提取源表和目标表，再应用经典路径查找算法及后处理技术确定最优的表列连接序列，从而使LLM能生成更精确的SQL查询。尽管该方法简单、成本效益高且具备高度可扩展性，但在BIRD基准测试中仍取得了最先进的成果，超越了先前基于LLM的专用、微调及复杂多步骤方法。我们通过详细消融实验研究了框架中的精确率-召回率权衡，并对比不同模型规模下模式过滤方法与其他方案在执行准确率上的表现。

The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

Abstract

arXiv:2505.18356v1 Announce Type: cross Abstract: Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.

摘要

大语言模型（LLMs）在非高资源语言任务中仍面临困难。本研究探讨了在任务特定训练数据稀缺的低资源语言中的跨语言迁移。基于先前工作，我们首先验证了模型参数中对数学推理和多语言能力最关键的子集明显不重叠。为利用任务与目标语言参数化之间的这种隐式可分离性，我们开发并分析了多种模块化框架，以改进两者在微调期间的组合。这些方法通常采用参数冻结或事后模型融合技术，将数学与语言能力的提升分别分配给大语言模型的不同关键部分。在缺乏目标语言数学数据的情况下，我们证明这些模块化方法在三种语言、四种模型和两种微调范式（全参数与LoRA）中均成功超越了基线水平。此外，我们发现最稳定有效的模块化方法是通过层交换技术微调独立的语言与数学专家模型并进行融合，这一结果出人意料。我们结合近期关于任务向量线性特征的研究给出了可能的解释，并通过实证表明：训练后回退低效的微调更新，往往优于从一开始就冻结这些参数。

Next-token pretraining implies in-context learning

Abstract

arXiv:2505.18373v1 Announce Type: cross Abstract: We argue that in-context learning (ICL) predictably arises from standard self-supervised next-token pretraining, rather than being an exotic emergent property. This work establishes the foundational principles of this emergence by focusing on in-distribution ICL, demonstrating how models necessarily adapt to context when trained on token sequences, especially from non-ergodic sources. Our information-theoretic framework precisely predicts these in-distribution ICL dynamics (i.e., context-dependent loss reduction). We verify this with experiments using synthetic datasets of differing types of correlational structure, reproducing characteristic phenomena like phase transitions in training loss for induction head formation and power-law scaling of in-context loss. We further show that a model's in-context performance on any task is mathematically coupled to the ensemble of tasks seen in pretraining, offering a fundamental explanation, grounded in architecture- and modality-independent principles, for such inference-time learning.

摘要

我们提出，上下文学习（ICL）可预测地源自标准的自监督下一词元预训练，而非一种特殊的涌现属性。本研究通过聚焦同分布ICL现象，阐明了这种涌现的基本原理：当模型在非遍历性数据源的词元序列上训练时，必然发展出适应上下文的能力。我们的信息论框架精确预测了这些同分布ICL动态（即上下文依赖的损失降低）。通过在不同相关结构的合成数据集上进行实验，我们验证了该框架的有效性，复现了训练损失中的特征现象——如归纳头形成时的相变和上下文损失的幂律缩放。进一步研究表明，模型在任何任务上的上下文表现都与预训练中接触的任务集合存在数学耦合，这为这种推理时学习提供了基于架构与模态无关原理的根本性解释。

LatentLLM: Attention-Aware Joint Tensor Compression

Abstract

arXiv:2505.18413v1 Announce Type: cross Abstract: Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.

摘要

现代基础模型（如大语言模型LLMs和大规模多模态模型LMMs）需要消耗巨大的计算和内存资源。我们提出了一种新框架，可将此类LLMs/LMMs转换为降维潜在结构。该方法将局部激活感知的张量分解扩展为全局注意力感知的联合张量分解。当降低潜在维度以实现计算/内存高效的LLMs/LMMs时，我们的框架能显著提升现有模型压缩方法的精度。我们在包括多模态推理任务在内的多个基准测试中验证了该方法的优势。

Thought calibration: Efficient and confident test-time scaling

Abstract

arXiv:2505.18404v1 Announce Type: cross Abstract: Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

摘要

大型语言模型通过延长推理时间实现了显著的测试时性能提升，但这种性能增益伴随着高昂的计算成本。直接限制测试时预算会损害整体性能，但并非所有问题都具有同等难度。我们提出思维校准方法，用于动态决定何时终止推理过程。为校准决策规则，我们将语言模型不断增长的思维体视为嵌套的推理树序列，其目标是识别新推理达到平台期的临界点。该框架通过轻量级探针实现，这些探针作用于语言模型的隐藏表示层，既能捕捉推理结构信息，又能评估响应整体一致性。基于三个推理语言模型和四个数据集的实验表明，思维校准在分布内数据上可减少高达60%的推理标记消耗，在分布外数据上减少达20%，同时保持模型性能。

$\mu$ -MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

Abstract

arXiv:2505.18451v1 Announce Type: cross Abstract: To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $\mu$ -MoE. Several experiments demonstrate that $\mu$ -MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.

摘要

为应对大型基础模型巨大的计算需求，无需重新训练的激活感知压缩技术应运而生。然而，由于这些技术依赖校准数据，在未知下游任务中可能出现域偏移问题。通过计算高效的校准过程，我们实现了针对每个提示的自适应激活感知剪枝，同时降低了推理复杂度。我们将其建模为一种微型专家混合系统（μ-MoE）。多项实验表明，μ-MoE能够实时动态适应任务/提示相关的结构化稀疏性。

Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps

Abstract

arXiv:2505.18426v1 Announce Type: cross Abstract: As connected and automated transportation systems evolve, there is a growing need for federal and state authorities to revise existing laws and develop new statutes to address emerging cybersecurity and data privacy challenges. This study introduces a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) framework designed to support policymakers by extracting relevant legal content and generating accurate, inquiry-specific responses. The framework focuses on reducing hallucinations in LLMs by using a curated set of domain-specific questions to guide response generation. By incorporating retrieval mechanisms, the system enhances the factual grounding and specificity of its outputs. Our analysis shows that the proposed RAG-based LLM outperforms leading commercial LLMs across four evaluation metrics: AlignScore, ParaScore, BERTScore, and ROUGE, demonstrating its effectiveness in producing reliable and context-aware legal insights. This approach offers a scalable, AI-driven method for legislative analysis, supporting efforts to update legal frameworks in line with advancements in transportation technologies.

摘要

随着互联与自动化交通系统的发展，联邦和州级监管机构亟需修订现有法律并制定新法规以应对新兴的网络安全和数据隐私挑战。本研究提出了一种基于检索增强生成（RAG）的大语言模型（LLM）框架，旨在通过提取相关法律内容并生成精准的查询响应来支持政策制定者。该框架通过使用特定领域问题集指导响应生成，有效减少大语言模型的幻觉现象。通过整合检索机制，系统显著增强了输出结果的事实依据与针对性。分析表明，基于RAG的大语言模型在AlignScore、ParaScore、BERTScore和ROUGE四项评估指标上均优于主流商用大语言模型，证实其在生成可靠且情境感知的法律见解方面的有效性。该方法为立法分析提供了可扩展的人工智能驱动解决方案，支持交通技术发展背景下的法律框架更新工作。

TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

Abstract

arXiv:2505.18434v1 Announce Type: cross Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.

摘要

视觉语言模型（VLM，如CLIP）在一系列下游任务中展现出强劲性能。然而，CLIP在否定理解（即识别概念缺失或排除的能力）方面仍存在局限。现有方法通过使用大语言模型（LLM）生成包含否定的大规模图像描述数据以微调CLIP，但这类方法耗时且计算密集，且评估通常仅限于图文匹配任务。为拓展研究边界，我们（1）提出一种训练时否定数据生成流程，使否定描述在训练阶段动态生成，仅增加2.5%的额外训练时间；（2）首次建立Neg-TtoI基准，用于评估文本到图像生成模型处理含否定提示时的语义准确性。实验表明，我们提出的TNG-CLIP方法在图文匹配、文本到图像检索及图像生成等多类否定基准测试中均达到最先进性能。

Efficient Long CoT Reasoning in Small Language Models

Abstract

arXiv:2505.18440v1 Announce Type: cross Abstract: Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.

摘要

近期诸如DeepSeek-R1等大型推理模型通过生成长链思维（CoT）推理步骤展现出强大的复杂问题解决能力。直接训练小型语言模型（SLMs）实现长链思维涌现具有挑战性，因此蒸馏成为赋予SLMs此类推理能力的实用方法。然而，长链思维常包含大量冗余内容（如过度思考步骤），考虑到SLMs相对有限的容量和泛化能力，这可能使其难以有效学习。针对该问题，我们提出一种简单而有效的方法来修剪长链思维中不必要的步骤，并采用策略内方法让SLM自身筛选有效且有用的长链思维训练数据。通过这种方式，SLMs既能高效学习长链思维推理，又能保持竞争优势。在一系列数学推理基准测试中的实验结果表明，所提方法能有效将长链思维推理能力蒸馏至SLMs，在保持性能竞争力的同时显著减少冗余推理步骤的生成。

Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications

Abstract

arXiv:2505.18488v1 Announce Type: cross Abstract: Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.

摘要

在将大语言模型（LLMs）应用于移动设备用户输入辅助时，纠错能力至关重要。本文利用LLMs合成高质量纠错配对数据集，以评估并优化移动应用场景下的语言模型性能。我们首先基于纠错领域知识设计提示方案，构建可扩展且可靠的数据合成流程扩展。随后通过样本重加权方法，使合成数据分布适配移动应用领域特性。该重加权模型通过预测生产环境中的少量A/B测试指标进行训练，其输入包括LLMs在离线评估数据上的表现以及小型隐私保护设备端语言模型的评分结果。最后，我们提出混合使用合成数据与其他数据源的最佳实践方案，以提升模型在离线评估和生产环境A/B测试中的纠错性能。

Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey

Abstract

arXiv:2505.18475v1 Announce Type: cross Abstract: Graphs are a widely used paradigm for representing non-Euclidean data, with applications ranging from social network analysis to biomolecular prediction. Conventional graph learning approaches typically rely on fixed structural assumptions or fully observed data, limiting their effectiveness in more complex, noisy, or evolving settings. Consequently, real-world graph data often violates the assumptions of traditional graph learning methods, in particular, it leads to four fundamental challenges: (1) Incompleteness, real-world graphs have missing nodes, edges, or attributes; (2) Imbalance, the distribution of the labels of nodes or edges and their structures for real-world graphs are highly skewed; (3) Cross-domain Heterogeneity, graphs from different domains exhibit incompatible feature spaces or structural patterns; and (4) Dynamic Instability, graphs evolve over time in unpredictable ways. Recent advances in Large Language Models (LLMs) offer the potential to tackle these challenges by leveraging rich semantic reasoning and external knowledge. This survey provides a comprehensive review of how LLMs can be integrated with graph learning to address the aforementioned challenges. For each challenge, we review both traditional solutions and modern LLM-driven approaches, highlighting how LLMs contribute unique advantages. Finally, we discuss open research questions and promising future directions in this emerging interdisciplinary field. To support further exploration, we have curated a repository of recent advances on graph learning challenges: https://github.com/limengran98/Awesome-Literature-Graph-Learning-Challenges.

Abstract

arXiv:2505.18464v1 Announce Type: cross Abstract: The growing demand for accessible mental health support, compounded by workforce shortages and logistical barriers, has led to increased interest in utilizing Large Language Models (LLMs) for scalable and real-time assistance. However, their use in sensitive domains such as anxiety support remains underexamined. This study presents a systematic evaluation of LLMs (GPT and Llama) for their potential utility in anxiety support by using real user-generated posts from the r/Anxiety subreddit for both prompting and fine-tuning. Our approach utilizes a mixed-method evaluation framework incorporating three main categories of criteria: (i) linguistic quality, (ii) safety and trustworthiness, and (iii) supportiveness. Results show that fine-tuning LLMs with naturalistic anxiety-related data enhanced linguistic quality but increased toxicity and bias, and diminished emotional responsiveness. While LLMs exhibited limited empathy, GPT was evaluated as more supportive overall. Our findings highlight the risks of fine-tuning LLMs on unprocessed social media content without mitigation strategies.

摘要

随着对便捷心理健康服务需求的日益增长，加之专业人才短缺和地理障碍等因素，利用大语言模型（LLMs）提供可扩展的实时辅助服务受到广泛关注。然而，这类模型在焦虑支持等敏感领域的应用仍缺乏系统研究。本研究通过使用r/Anxiety版块真实用户发帖作为提示词和微调数据，对GPT和Llama等大语言模型在焦虑支持中的潜在效用进行系统评估。我们采用混合方法评估框架，包含三个主要标准类别：（1）语言质量；（2）安全性与可信度；（3）支持性。结果表明，基于自然焦虑数据微调的模型虽提升了语言质量，但毒性偏见增加、情感响应性降低。大语言模型整体表现出有限共情能力，其中GPT被评估为更具支持性。本研究揭示了在缺乏缓解策略情况下，直接使用未经处理的社交媒体内容微调大语言模型的风险。

Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services

Abstract

arXiv:2505.18471v1 Announce Type: cross Abstract: Modern large language model (LLM) services increasingly rely on complex, often abstract operations, such as multi-step reasoning and multi-agent collaboration, to generate high-quality outputs. While users are billed based on token consumption and API usage, these internal steps are typically not visible. We refer to such systems as Commercial Opaque LLM Services (COLS). This position paper highlights emerging accountability challenges in COLS: users are billed for operations they cannot observe, verify, or contest. We formalize two key risks: \textit{quantity inflation}, where token and call counts may be artificially inflated, and \textit{quality downgrade}, where providers might quietly substitute lower-cost models or tools. Addressing these risks requires a diverse set of auditing strategies, including commitment-based, predictive, behavioral, and signature-based methods. We further explore the potential of complementary mechanisms such as watermarking and trusted execution environments to enhance verifiability without compromising provider confidentiality. We also propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals. Our aim is to encourage further research and policy development toward transparency, auditability, and accountability in commercial LLM services.

摘要

现代大型语言模型（LLM）服务日益依赖复杂且通常抽象的操作（如多步推理与多智能体协作）来生成高质量输出。尽管用户计费基于令牌消耗和API使用量，但这些内部步骤通常不可见。我们将此类系统称为商业不透明LLM服务（COLS）。本立场文件揭示了COLS中新兴的问责挑战：用户为无法观察、验证或质疑的操作付费。我们形式化了两大风险：\textit{数量膨胀}（令牌和调用计数可能被人为夸大）与\textit{质量降级}（提供商可能悄然替换低成本模型或工具）。应对这些风险需要多样化的审计策略，包括基于承诺、预测、行为及签名的方法。我们进一步探讨了水印与可信执行环境等补充机制在提升可验证性同时不损害提供商机密性的潜力。此外，我们提出了面向COLS与用户的模块化三层审计框架，该框架支持跨执行、安全日志记录和用户可审计性的可信验证，且无需暴露专有内部信息。本研究旨在推动商业LLM服务在透明度、可审计性与问责制方面的进一步研究与政策制定。

AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

Abstract

arXiv:2505.18512v1 Announce Type: cross Abstract: Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.

摘要

基于大语言模型（LLMs）的列表式重排序能够提升检索应用中排名靠前的结果质量。由于上下文长度限制及长上下文推理成本较高，重排序通常仅针对固定数量的小规模候选子集进行，最终排序结果由这些局部结果聚合而成。这种固定计算模式忽视了查询难度与文档分布特性，导致效率低下。我们提出AcuRank——一种自适应重排序框架，通过基于文档相关性不确定性估计的动态机制，自适应调整计算量与计算目标。该方法采用贝叶斯TrueSkill模型迭代优化相关性估计直至达到足够置信度，其显式的排序不确定性建模实现了对重排序行为的可控调节，避免对高置信度预测进行不必要的更新。在TREC-DL和BEIR基准测试上的实验表明，本方法始终能实现更优的准确率-效率权衡，且计算扩展性优于固定计算基线。这些结果验证了我们的方法在不同检索任务和基于LLM的重排序模型中具有显著的有效性与泛化能力。

From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test

Abstract

arXiv:2505.18562v1 Announce Type: cross Abstract: The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.

摘要

以人为中心的词汇联想测试（WAT）作为认知代理，通过词汇语义模式揭示社会文化差异。本研究将该测试扩展为适应大语言模型（LLM）的自由联想任务，用于评估大语言模型与跨文化认知的契合度。为消除文化偏好，我们提出CultureSteer创新方法，通过集成文化感知引导机制，将语义表征导向特定文化空间。实验表明，当前大语言模型在词汇联想层面显著偏向西方文化（尤其是美国）图式；相较之下，我们的模型显著提升了跨文化契合度，在捕捉多样化语义关联方面超越基于提示词的方法。在文化敏感性下游任务中的进一步验证证实了该方法在促进跨文化认知对齐方面的有效性。本研究为增强大语言模型的文化意识提供了新颖的方法论范式，推动了更具包容性语言技术的发展。

G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning

Abstract

arXiv:2505.18499v1 Announce Type: cross Abstract: Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs' graph reasoning abilities. To enable RL training, we curate Erd~os, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erd~os, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully.

摘要

尽管大型语言模型（LLMs）已展现出显著进展，但其在图相关任务中的表现仍存在明显局限，这阻碍了通用模型的真正发展。先前尝试（包括预训练图基础模型或采用监督微调）常面临大规模通用图数据稀缺等挑战。我们提出G1——一种简单而有效的方法，证明在合成图论任务上通过强化学习（RL）可显著扩展LLMs的图推理能力。为支持RL训练，我们构建了迄今最大规模的图推理数据集Erd~os，包含50种不同难度的多样化图论任务、10万训练数据和5千测试数据，所有数据均源自真实世界图结构。通过在Erd~os上进行RL训练，G1实现了图推理能力的显著提升：经微调的30亿参数模型甚至超越Qwen2.5-72B-Instruct（规模为其24倍）。RL训练模型还展现出对未见任务、领域及图编码方案的强大零样本泛化能力，包括其他图论基准测试以及真实世界的节点分类和链接预测任务，且不影响通用推理能力。我们的研究为构建强图推理器提供了一条高效、可扩展的路径：通过在图论任务上对LLMs进行RL微调，将预训练LLM能力与自动生成的丰富合成数据优势相结合，这表明LLMs具备可通过RL成功激发的图理解能力。

FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation

Abstract

arXiv:2505.18494v1 Announce Type: cross Abstract: Federated Learning (FL) facilitates the fine-tuning of Foundation Models (FMs) using distributed data sources, with Low-Rank Adaptation (LoRA) gaining popularity due to its low communication costs and strong performance. While recent work acknowledges the benefits of heterogeneous LoRA in FL and introduces flexible algorithms to support its implementation, our theoretical analysis reveals a critical gap: existing methods lack formal convergence guarantees due to parameter truncation and biased gradient updates. Specifically, adapting client-specific LoRA ranks necessitates truncating global parameters, which introduces inherent truncation errors and leads to subsequent inaccurate gradient updates that accumulate over training rounds, ultimately degrading performance. To address the above issues, we propose \textbf{FedHL}, a simple yet effective \textbf{Fed}erated Learning framework tailored for \textbf{H}eterogeneous \textbf{L}oRA. By leveraging the full-rank global model as a calibrated aggregation basis, FedHL eliminates the direct truncation bias from initial alignment with client-specific ranks. Furthermore, we derive the theoretically optimal aggregation weights by minimizing the gradient drift term in the convergence upper bound. Our analysis shows that FedHL guarantees $\mathcal{O}(1/\sqrt{T})$ convergence rate, and experiments on multiple real-world datasets demonstrate a 1-3% improvement over several state-of-the-art methods.

摘要

联邦学习（FL）支持利用分布式数据源对基础模型（FM）进行微调，其中低秩自适应（LoRA）因其低通信成本和优异性能而广受关注。尽管近期研究认识到异构LoRA在FL中的优势，并提出了灵活算法支持其实现，但我们的理论分析揭示了一个关键缺陷：现有方法由于参数截断和梯度更新偏差而缺乏形式化收敛保证。具体而言，为适应客户端特定的LoRA秩，需对全局参数进行截断，这会引入固有截断误差，并导致后续梯度更新不准确，这些误差在训练轮次中不断累积，最终降低模型性能。为解决上述问题，我们提出\textbf{FedHL}——一个简单而有效的、专为\textbf{异构}\textbf{LoRA}设计的\textbf{联邦学习}框架。该方法通过将全秩全局模型作为校准聚合基准，消除了与客户端特定秩初始对齐时的直接截断偏差。此外，我们通过最小化收敛上界中的梯度漂移项，推导出理论最优聚合权重。分析表明FedHL可保证 $\mathcal{O}(1/\sqrt{T})$ 的收敛速率，在多个真实数据集上的实验显示其性能较现有最优方法提升1-3%。

CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs

Abstract

arXiv:2505.18527v1 Announce Type: cross Abstract: Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from https://github.com/murai-lab/CLaDMoP.

摘要

现有许多临床试验结果预测模型通过在特定试验阶段数据上使用任务专用损失函数进行优化。尽管这种方案可能提升常见疾病和药物的预测效果，但会阻碍可泛化表征的学习，导致更多假阳性/假阴性结果。为克服这一局限，我们提出CLaDMoP——一种新的临床试验结果预训练方法，并为此专门构建了成功临床试验数据集(SCT)。CLaDMoP利用大型语言模型编码试验的入选标准，通过新型多层次融合技术将其与轻量级药物分子分支相连接。为高效融合跨层次的长嵌入向量，我们引入分组模块，显著降低计算开销。该方法通过"配对匹配"代理任务进行预训练，避免依赖任务特定目标。相较于成熟的零样本和小样本基线，我们的方法在PR-AUC和ROC-AUC指标上均取得显著提升，尤其在一期和二期临床试验中表现突出。我们在参数高效微调后对CLaDMoP进行评估和消融实验，与包括MEXA-CTP在内的监督学习基线在试验结果预测(TOP)基准上进行比较。CLaDMoP实现PR-AUC最高提升10.5%、ROC-AUC提升3.6%，同时获得与MEXA-CTP相当的F1分数，展现了其在临床试验结果预测中的应用潜力。代码和SCT数据集可从https://github.com/murai-lab/CLaDMoP下载。

Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

Abstract

arXiv:2505.18536v1 Announce Type: cross Abstract: Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.

摘要

站在2025年这一追求通用人工智能（AGI）的关键节点，强化微调（RFT）技术在提升大语言模型（LLMs）推理能力方面已展现出显著潜力，并催生了OpenAI-o1与DeepSeek-R1等尖端AI模型。更值得注意的是，RFT在增强多模态大语言模型（MLLMs）推理能力方面的有效应用已引发学界广泛关注。本立场文件论证了强化微调技术对多模态大语言模型推理能力的赋能作用。首先，我们系统介绍了该领域研究者应掌握的基础背景知识；进而将RFT提升MLLMs推理能力的进展精炼为五大要点：多模态支持、多任务与多领域适应、优化训练算法、丰富基准测试体系及蓬勃发展的工程框架；最后提出了五个值得学界探索的未来研究方向。我们期待这份立场文件能为AGI发展关键阶段的学术共同体提供有价值的洞见。RFT应用于MLLMs的研究成果汇总详见https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs。

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Abstract

arXiv:2505.18556v1 Announce Type: cross Abstract: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

摘要

意图检测作为自然语言理解的核心组件，已发展成为保护大语言模型（LLMs）安全的关键机制。尽管先前研究已应用意图检测来强化LLMs的内容审核护栏，并在防御内容层面越狱攻击方面取得显著成效，但这些意图感知护栏在恶意操纵下的鲁棒性仍未得到充分探索。本研究揭示了意图感知护栏的脆弱性，并证明LLMs具有隐式意图检测能力。我们提出了一种两阶段基于意图的提示优化框架IntentPrompt：首先将有害查询转化为结构化纲要，继而通过反馈循环迭代优化提示，将其重构为陈述式叙述以提升红队测试的越狱成功率。在四个公开基准测试和多种黑盒LLMs上的大量实验表明，本框架持续优于多种前沿越狱方法，并能规避包括意图分析（IA）和思维链（CoT）在内的先进防御机制。具体而言，我们的"FSTR+SPIN"变体在o1模型上对CoT防御的攻击成功率达88.25%至96.54%，在GPT-4o模型上对IA防御的攻击成功率达86.75%至97.12%。这些发现揭示了LLMs安全机制的关键弱点，表明意图操纵对内容审核护栏构成日益严峻的挑战。

Removal of Hallucination on Hallucination: Debate-Augmented RAG

Abstract

arXiv:2505.18581v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate-Augmented-RAG.

摘要

检索增强生成（RAG）通过整合外部知识提升事实准确性，但引入了一个关键问题：错误或有偏见的检索可能误导生成过程，加剧幻觉现象，我们称之为"幻觉叠加"。为解决这一问题，我们提出辩论增强RAG（DRAG），这是一种无需训练的框架，将多智能体辩论（MAD）机制整合到检索和生成阶段。在检索阶段，DRAG采用支持者、反对者和裁判的结构化辩论机制，优化检索质量并确保事实可靠性。在生成阶段，DRAG引入非对称信息角色和对抗性辩论，增强推理鲁棒性并减少事实不一致性。多任务评估表明，DRAG能提高检索可靠性，减少RAG引发的幻觉，并显著提升整体事实准确性。我们的代码发布于https://github.com/Huenao/Debate-Augmented-RAG。

Safety Alignment via Constrained Knowledge Unlearning

Abstract

arXiv:2505.18588v1 Announce Type: cross Abstract: Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.

摘要

尽管在安全对齐方面取得了显著进展，大型语言模型（LLM）仍易受越狱攻击影响。现有防御机制未能完全消除模型中的有害知识，导致攻击者可绕过防护措施生成有害输出。针对这一挑战，我们提出了一种新颖的安全对齐策略——约束性知识遗忘（CKU），该策略聚焦两大目标：知识定位与保留，以及有害知识遗忘。CKU通过为特定多层感知机（MLP）层中的神经元评分，识别出与有用知识相关的神经元子集U。在遗忘过程中，CKU对U中神经元的梯度进行剪枝，在有效消除有害内容的同时保留有价值的知识。实验结果表明，CKU能在不影响模型整体性能的前提下显著提升安全性，相比现有方法实现了安全性与实用性的更优平衡。此外，我们对不同MLP层神经元知识敏感度的分析，为安全对齐和模型知识编辑的机制提供了重要见解。

MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

Abstract

arXiv:2505.18595v1 Announce Type: cross Abstract: We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality - containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.

摘要

我们研究合作多智能体环境下的离线模仿学习（IL），其中演示数据包含未标注的混合质量轨迹——既有专家级也有次优轨迹。提出的解决方案采用两阶段结构：轨迹标注和多智能体模仿学习，通过联合设计实现从异构未标注数据中有效学习。第一阶段结合大型语言模型和基于偏好的强化学习技术，构建渐进式标注流程以识别专家级轨迹。第二阶段提出MisoDICE算法，这是一种新型多智能体IL方法，利用标注信息学习鲁棒策略，同时解决大规模联合状态-动作空间的计算复杂度问题。通过将流行的单智能体DICE框架扩展至多智能体场景，并采用新的价值分解与混合架构，我们的方法产生了凸策略优化目标，确保全局与局部策略的一致性。在多个标准多智能体强化学习基准测试中评估MisoDICE，结果表明其性能优越，尤其在专家数据稀缺时表现突出。

Autocomp: LLM-Driven Code Optimization for Tensor Accelerators

Abstract

arXiv:2505.18574v1 Announce Type: cross Abstract: Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages like specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three categories of representative workloads and two different accelerators, we demonstrate that Autocomp-optimized code runs 5.6x (GEMM) and 2.7x (convolution) faster than the vendor-provided library, and outperforms expert-level hand-tuned code by 1.4x (GEMM), 1.1x (convolution), and 1.3x (fine-grained linear algebra). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.

摘要

硬件加速器，尤其是专为张量处理设计的加速器，已在当今计算领域无处不在。然而，尽管在编译器构建方面投入了大量努力，对这些张量加速器进行编程仍然具有挑战性，导致其潜力远未得到充分利用。近期，基于海量代码训练的大型语言模型（LLMs）在代码生成与优化任务中展现出显著潜力，但生成专用张量加速器代码等低资源语言仍面临重大挑战。我们提出Autocomp方法应对这一挑战，该方法使加速器程序员能够利用领域知识和硬件反馈，通过自动化LLM驱动搜索优化代码。具体实现包括：1）将每个优化过程构建为结构化的两阶段提示（规划阶段与代码生成阶段），2）在规划阶段通过简洁可适配的优化菜单注入领域知识，3）在每次搜索迭代中整合来自硬件的正确性指标与性能指标作为反馈。在三大类典型工作负载和两种不同加速器上的实验表明，经Autocomp优化的代码运行速度比厂商提供的库快5.6倍（GEMM）和2.7倍（卷积），并分别以1.4倍（GEMM）、1.1倍（卷积）和1.3倍（细粒度线性代数）的优势超越专家级手工调优代码。此外，我们证明Autocomp生成的优化方案可在相似张量运算中复用，在固定样本预算下将加速效果提升最高达24%。

Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Abstract

arXiv:2505.18596v1 Announce Type: cross Abstract: The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. The code will be open-sourced in a future release.

摘要

数字平台中虚假信息的泛滥暴露了传统检测方法的局限性，这些方法主要依赖静态分类，无法捕捉现实世界事实核查的复杂过程。尽管大型语言模型（LLMs）的进步增强了自动推理能力，但其在虚假信息检测中的应用仍受困于逻辑不一致性和表面化验证等问题。为此，我们提出"辩论式检测"（Debate-to-Detect，D2D）——一种新颖的多智能体辩论框架，将虚假信息检测重构为结构化对抗辩论。受事实核查工作流程启发，D2D为每个智能体分配特定领域角色，并设计五阶段辩论流程：开场陈述、反驳、自由辩论、结辩陈述和裁决。为超越传统二元分类，D2D引入多维评估机制，从五个维度评估每项主张：事实性、信源可靠性、推理质量、清晰度和伦理合规性。基于GPT-4o在两个假新闻数据集上的实验表明，该方法较基线有显著提升，案例研究凸显D2D能迭代优化证据并提高决策透明度，标志着向稳健可解释的虚假信息检测迈出重要一步。代码将于后续版本开源。

Rethinking Causal Mask Attention for Vision-Language Inference

Abstract

arXiv:2505.18605v1 Announce Type: cross Abstract: Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.

摘要

因果注意力已成为自回归视觉语言模型（VLM）的基础机制，将文本与视觉输入统一在单一生成框架下。然而，现有基于因果掩码的策略继承自纯文本解码的大语言模型（LLM），其在预填充阶段对视觉标记的适应性处理不足。对视觉查询严格屏蔽未来位置会引入过度刚性约束，阻碍模型利用常含关键语义线索的未来上下文进行准确推理。本研究通过实证探讨不同因果掩码策略如何影响视觉语言推理，进而提出适用于该场景的未来感知注意力机制家族。我们首先实证分析了视觉查询中预览未来标记的效果，证明刚性掩码会削弱模型捕获有用上下文语义表征的能力。基于这些发现，我们提出一种轻量级注意力家族，通过池化将未来视觉上下文聚合到历史表征中，在保持自回归结构的同时增强跨标记依赖性。我们在多样化视觉语言推理场景中评估了多种因果掩码，结果表明有选择地将未来语义上下文压缩至历史表征有利于提升推理性能。

LLM-Meta-SR: Learning to Evolve Selection Operators for Symbolic Regression

Abstract

arXiv:2505.18602v1 Announce Type: cross Abstract: Large language models (LLMs) have revolutionized algorithm development, yet their application in symbolic regression, where algorithms automatically discover symbolic expressions from data, remains constrained and is typically designed manually by human experts. In this paper, we propose a learning-to-evolve framework that enables LLMs to automatically design selection operators for evolutionary symbolic regression algorithms. We first identify two key limitations in existing LLM-based algorithm evolution techniques: code bloat and a lack of semantic guidance. Bloat results in unnecessarily complex components, and the absence of semantic awareness can lead to ineffective exchange of useful code components, both of which can reduce the interpretability of the designed algorithm or hinder evolutionary learning progress. To address these issues, we enhance the LLM-based evolution framework for meta symbolic regression with two key innovations: bloat control and a complementary, semantics-aware selection operator. Additionally, we embed domain knowledge into the prompt, enabling the LLM to generate more effective and contextually relevant selection operators. Our experimental results on symbolic regression benchmarks show that LLMs can devise selection operators that outperform nine expert-designed baselines, achieving state-of-the-art performance. This demonstrates that LLMs can exceed expert-level algorithm design for symbolic regression.

摘要

大语言模型（LLMs）已经彻底改变了算法开发的范式，但其在符号回归（即算法从数据中自动发现符号表达式）中的应用仍受限制，且通常由人类专家手动设计。本文提出一种"学习进化"框架，使LLMs能够自动为进化式符号回归算法设计选择算子。我们首先指出现有基于LLM的算法进化技术存在两个关键局限：代码膨胀和语义引导缺失。代码膨胀会导致生成不必要的复杂组件，而语义意识的缺乏可能阻碍有效代码组件的交换，这两者都会降低所设计算法的可解释性或阻碍进化学习进程。为解决这些问题，我们通过两项关键创新增强了基于LLM的元符号回归进化框架：膨胀控制和互补的语义感知选择算子。此外，我们将领域知识嵌入提示词中，使LLM能生成更有效且符合上下文的选择算子。在符号回归基准测试中的实验结果表明，LLMs设计的选择算子性能优于九种专家设计的基线方法，达到了最先进的水平。这证明LLMs在符号回归领域的算法设计能力可以超越专家水平。

DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

Abstract

arXiv:2505.18630v1 Announce Type: cross Abstract: Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling and independently optimizing the the two sub-tasks through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task.

摘要

大语言模型（LLMs）展现出强大的泛化与推理能力，使其特别适合医疗咨询（MC）等复杂决策任务。然而，现有基于LLM的方法往往未能捕捉MC的双重特性——该任务包含两个截然不同的子任务：作为序列决策过程的症状问询，以及作为分类问题的疾病诊断。这种失配常导致症状问询低效与疾病诊断不可靠。为此，我们提出DDO框架，通过协作式多智能体工作流对两个子任务进行解耦与独立优化，实现双决策优化。在三个真实世界MC数据集上的实验表明，DDO始终优于现有基于LLM的方法，并与最先进的生成式方法达到相当性能，验证了其在MC任务中的有效性。

Flex-Judge: Think Once, Judge Anywhere

Abstract

arXiv:2505.18601v1 Announce Type: cross Abstract: Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

摘要

人类生成的奖励信号对于将生成模型与人类偏好对齐、指导训练及推理时评估至关重要。虽然采用大语言模型（LLMs）作为代理评估器（即LLM-as-a-Judge）能显著降低人工标注成本，但这些模型通常需要大量特定模态的训练数据，且难以在多样化多模态任务中实现良好泛化。本文提出Flex-Judge——一种基于推理引导的多模态评判模型，该模型利用极少量文本推理数据即可在多种模态和评估格式间实现稳健泛化。我们的核心观点是：结构化文本推理解释本身编码了可泛化的决策模式，从而能有效迁移至图像或视频等多模态评判任务。实验结果表明，Flex-Judge尽管仅使用显著更少的文本数据进行训练，其性能仍可与最先进的商业API及经过大量训练的多模态评估器相媲美甚至更优。值得注意的是，Flex-Judge在分子等缺乏全面评估基准的模态中展现出广泛影响力，凸显了其在资源受限领域的实用价值。该框架表明，基于推理的文本监督可作为传统高成本标注方法的强效替代方案，为可扩展的多模态模型即评判者（model-as-a-judge）提供了重要推进。

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Abstract

arXiv:2505.18646v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where complex coding tasks are decomposed into sub-tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand-crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose \textbf{S}elf-\textbf{E}volving \textbf{W}orkflow (\textbf{SEW}), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self-evolution, bringing up to 33% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.

摘要

大语言模型（LLMs）在代码生成任务中已展现出显著成效。为使LLMs能够应对更复杂的编程挑战，现有研究致力于构建具有代理工作流的多智能体系统，将复杂编码任务分解为子任务并分配给专业化代理。尽管这些方法有效，当前方案仍严重依赖手工设计的代理工作流，其智能体拓扑结构和提示词均为人工设定，这限制了其自动适应不同类型编码问题的能力。为解决这些局限并实现工作流自动设计，我们提出\textbf{自进化工作流（SEW）}，这是一种能自动生成并优化多智能体工作流的新型自进化框架。在三个代码基准数据集（包括高难度的LiveCodeBench）上的大量实验表明，我们的SEW能通过自主进化设计并优化代理工作流，相比仅使用骨干LLM，在LiveCodeBench上最高可带来33%的性能提升。此外，通过探究工作流的不同表示方案，我们为文本编码工作流信息的最优方式提供了理论依据。

Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

Abstract

arXiv:2505.18658v1 Announce Type: cross Abstract: Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.

摘要

大型语言模型（LLMs）已成为推动自然语言处理（NLP）和人工智能（AI）发展的关键基石。然而，确保其鲁棒性仍是重要挑战。为应对这些问题并推动领域进展，本综述对该领域现有研究进行了全面梳理。首先，我们系统性地探讨了LLMs鲁棒性的本质，包括其概念基础、多样化输入下保持性能一致的重要性，以及实际应用中失效模式的影响。其次，我们分析了非鲁棒性的来源，将其归类为内在模型局限、数据驱动的脆弱性，以及影响可靠性的外部对抗因素。随后，我们综述了前沿的缓解策略，进而讨论了广泛采用的基准测试、新兴评估指标及现实场景可靠性评估中存在的持续缺陷。最后，通过整合现有综述与跨学科研究成果，我们揭示了当前趋势、待解难题以及未来研究的潜在路径。

Large Language Models in the Task of Automatic Validation of Text Classifier Predictions

Abstract

arXiv:2505.18688v1 Announce Type: cross Abstract: Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model's entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.

摘要

文本分类的机器学习模型通过训练预测给定文本的类别。为此需要准备训练和验证样本：收集文本集合并为每篇文本标注类别。这些类别通常由不同专业水平的人工标注者根据具体分类任务进行标注。从头开始收集此类样本需要耗费大量人力，因为需寻找专家并支付报酬；此外可用专家数量有限，且其生产力受人因因素制约。虽然单次样本收集可能资源消耗不大，但为解决数据漂移（亦称模型漂移）而持续进行的模型重训练（特别是在增量学习流程中），使得数据收集过程在模型整个生命周期中至关重要且成本高昂。本文提出用大语言模型替代人工标注者的若干方法，以测试分类器预测的正确性，从而保障模型质量并支持高质量的增量学习。

ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation

Abstract

arXiv:2505.18640v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in multiple tasks simultaneously, motivating the need for efficient multi-task adaptation. While recent approaches integrate LoRA with mixture-of-experts (MoE) to address this, the use of routers prevents parameter mergeability, which increases inference overhead and hinders unified multi-task adaptation, thereby limiting deployment practicality. In this work, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables multi-task adaptation while preserving the inference efficiency of LoRA. ThanoRA jointly models task heterogeneity and mitigates subspace interference throughout training. Specifically, motivated by inherent differences in complexity and heterogeneity across tasks, ThanoRA constructs task-specific LoRA subspaces at initialization, enabling fine-grained knowledge injection aligned with task heterogeneity. Furthermore, to prevent task interference and subspace collapse during multi-task training, ThanoRA introduces a subspace-preserving regularization that maintains the independence of task-specific representations. With the synergy of both components, ThanoRA enables efficient and unified multi-task adaptation. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently achieves robust and superior performance over strong baselines without introducing additional inference overhead. Our code is publicly available at: https://github.com/LiangJian24/ThanoRA.

摘要

低秩自适应（LoRA）因其高效性和零额外推理成本的优势，被广泛用于基础模型的下游微调。许多实际应用要求基础模型能同时适应多任务处理，这推动了对高效多任务自适应方法的需求。尽管近期研究尝试将LoRA与专家混合（MoE）相结合来解决这一问题，但路由器的使用导致参数无法合并，从而增加推理开销并阻碍统一的多任务自适应，限制了实际部署的可行性。本研究提出ThanoRA框架——一种任务异构感知的多任务低秩自适应方法，在保持LoRA推理效率的同时实现多任务自适应。ThanoRA通过联合建模任务异构性并在整个训练过程中减轻子空间干扰来实现这一目标。具体而言，基于任务间固有复杂度与异构性的差异，ThanoRA在初始化阶段构建任务特定的LoRA子空间，实现与任务异构性对齐的细粒度知识注入。此外，为防止多任务训练中的任务干扰和子空间坍缩，ThanoRA引入子空间保持正则化机制以维持任务特定表征的独立性。通过双组件的协同作用，ThanoRA实现了高效统一的多任务自适应。在多模态及纯文本基准测试上的大量实验表明，在不同多任务混合场景下，ThanoRA始终以稳健且优越的性能超越强基线方法，且未引入额外推理开销。代码已开源：https://github.com/LiangJian24/ThanoRA。

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Abstract

arXiv:2505.18659v1 Announce Type: cross Abstract: Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (\texttt{PPI}) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose \texttt{R-AutoEval+}, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of \texttt{R-AutoEval+} is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, and for prompt design in LLMs confirm the reliability and efficiency of \texttt{R-AutoEval+}.

摘要

从多个候选模型（如大语言模型LLMs）中选择人工智能（AI）模型需要准确的性能评估。理想情况下，这应通过涉及大量真实世界数据的实证评估来实现。然而，此类评估成本高昂且难以大规模实施。为解决这一挑战，自动评估方法利用自动化评估器（如LLMs-as-judges）生成的合成数据来降低方差，但可能引入偏差。近期研究采用半监督预测驱动推断（\texttt{PPI}）来校正自动评估器的偏差。然而在实际应用中，与仅使用真实数据的传统方法相比，自动评估器可能导致样本效率下降。本文提出\texttt{R-AutoEval+}框架，该框架在保证模型评估具有有限样本可靠性的同时，相较于传统方法能提升（或至少不降低）样本效率。\texttt{R-AutoEval+}的核心创新在于自适应构建模型评估变量，动态调整对合成数据的依赖程度，当自动评估器精度不足时自动回归传统方法。在LLM权重量化设置优化和LLM提示设计场景中使用LLMs-as-judges的实验证实了\texttt{R-AutoEval+}的可靠性与高效性。

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Abstract

arXiv:2505.18675v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.

摘要

多模态大语言模型（MLLMs）近期在视觉任务中取得显著进展，涵盖语义场景理解和图文对齐等领域，其推理变体更在涉及数学与逻辑的复杂任务上表现出性能提升。然而，这些模型在需要细粒度视觉理解的推理任务中的能力尚未得到充分评估。为此，我们提出ReasonMap基准测试，旨在系统评估MLLMs的细粒度视觉理解与空间推理能力。该基准包含来自13个国家30个城市的高清交通路线图，共计1,008个涵盖两种问题类型和三种模板的问答对。我们进一步设计了两级评估流程，以准确评判答案的正确性与质量。通过对15个主流MLLMs（包括基础版与推理变体）的全面测试，发现一个反直觉现象：开源模型中基础版性能优于推理版，而闭源模型则呈现相反趋势。此外，当视觉输入被遮蔽时模型性能普遍下降，这表明尽管MLLMs能利用先验知识回答部分问题，但优秀的细粒度视觉推理仍需依赖真实的视觉感知。本研究为视觉推理领域提供了新见解，并为探索开源与闭源模型间的性能差距贡献了研究基础。

Steering LLM Reasoning Through Bias-Only Adaptation

Abstract

arXiv:2505.18706v1 Announce Type: cross Abstract: Recent work on reasoning-oriented language models, exemplified by o1-like systems, suggests that reinforcement-learning (RL) finetuning does not create new capabilities but instead strengthens reasoning patterns already latent in the pretrained network. We test this claim by training steering vectors: layer-wise biases that additively amplify selected hidden features while leaving all original weights unchanged. Experiments on four base models across the GSM8K and MATH benchmarks show that steering vectors recover, and in several cases exceed, the accuracy of fully-tuned counterparts. This result supports the view that the required reasoning skills pre-exist in the base model. Further, logit-lens analysis reveals that the trained vectors consistently boost token groups linked to structured languages and logical connectors, providing an interpretable account that aligns with the demands of quantitative reasoning tasks.

摘要

近期关于推理导向语言模型的研究（以o1类系统为例）表明，强化学习（RL）微调并不会创造新能力，而是强化了预训练网络中已有的潜在推理模式。我们通过训练导向向量（即逐层偏置项，以加法方式放大选定隐藏特征同时保持原始权重不变）来验证这一主张。在GSM8K和MATH基准测试中对四个基础模型进行的实验显示，导向向量恢复并在多个案例中超越了完全微调模型的准确率。这一结果支持了"所需推理技能已存在于基础模型中"的观点。此外，logit透镜分析表明，训练后的向量持续增强了与结构化语言和逻辑连接词相关的标记组，为定量推理任务的需求提供了可解释的依据。

Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study

Abstract

arXiv:2505.18697v1 Announce Type: cross Abstract: Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: https://github.com/ZhixunLEE/LLM4GCL.

摘要

当今世界，包括图结构数据在内的现实数据往往以流式方式到达，这意味着学习系统需要在不遗忘已掌握知识的前提下持续获取新信息。尽管现有大量研究致力于解决图机器学习中的灾难性遗忘问题，但这些方法均基于流式数据从头训练的范式。随着预训练模型的兴起，越来越多的研究利用其强大的泛化能力进行持续学习。为此，本研究旨在探究大型语言模型（LLMs）能否缓解图持续学习（GCL）中的灾难性遗忘问题。我们首先指出当前GCL实验设置存在重大缺陷，其评估阶段可能导致任务ID泄露。随后在更贴近现实的场景下评估LLMs性能，发现即使进行细微调整也能获得卓越效果。最终通过大量实验提出一种简单而有效的方法——简单图持续学习（SimGCL），在无排练约束条件下以约20%的优势超越此前最先进的基于图神经网络的基线方法。为促进可复现性研究，我们开发了易于使用的基准框架LLM4GCL，用于训练和评估现有GCL方法。代码已开源：https://github.com/ZhixunLEE/LLM4GCL。

GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis

Abstract

arXiv:2505.18710v1 Announce Type: cross Abstract: The Retrieval-Augmented Generation (RAG) framework introduces a retrieval module to dynamically inject retrieved information into the input context of large language models (LLMs), and has demonstrated significant success in various NLP tasks. However, the current study points out that there is a preference gap between retrievers and LLMs in the RAG framework, which limit the further improvement of system performance. Some highly relevant passages may interfere with LLM reasoning because they contain complex or contradictory information; while some indirectly related or even inaccurate content may help LLM generate more accurate answers by providing suggestive information or logical clues. To solve this, we propose GainRAG, a novel approach that aligns the retriever's and LLM's preferences by defining a new metric, "gain", which measure how well an input passage contributes to correct outputs. Specifically, we propose a method to estimate these gain signals and train a middleware that aligns the preferences of the retriever and the LLM using only limited data. In addition, we introduce a pseudo-passage strategy to mitigate degradation. The experimental results on 6 datasets verify the effectiveness of GainRAG.

摘要

检索增强生成（RAG）框架通过引入检索模块，将检索到的信息动态注入大型语言模型（LLM）的输入上下文，已在多种自然语言处理任务中展现出显著成效。然而，当前研究指出RAG框架中检索器与LLM之间存在偏好差异，这限制了系统性能的进一步提升。某些高相关性段落可能因包含复杂或矛盾信息而干扰LLM推理；而部分间接相关甚至不准确的内容，却可能通过提供提示性信息或逻辑线索帮助LLM生成更准确的答案。为此，我们提出GainRAG方法，通过定义新指标"增益"来衡量输入段落对正确输出的贡献程度，从而实现检索器与LLM的偏好对齐。具体而言，我们提出一种增益信号估计方法，并训练仅需有限数据即可实现两者偏好对齐的中间件。此外，我们引入伪段落策略以缓解性能退化问题。在6个数据集上的实验结果验证了GainRAG的有效性。

How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

Abstract

arXiv:2505.18761v1 Announce Type: cross Abstract: We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

摘要

我们提出了'含干扰情境的小学数学题'(GSM-DC)这一合成基准，用于评估大语言模型(LLM)在系统控制无关情境(IC)下的推理鲁棒性。GSM-DC通过构建符号化推理图并注入精确设计的干扰项，实现了严格且可复现的评估。实验表明，LLM对无关情境表现出显著敏感性，这种干扰既影响推理路径选择也降低算术准确性。此外，采用强干扰项进行模型训练可提升其在分布内和分布外场景的表现。我们进一步提出了一种基于过程奖励模型的逐步树搜索方法，该方法显著增强了模型在分布外条件下的鲁棒性。

Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

Abstract

arXiv:2505.18720v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.

摘要

直接偏好优化（DPO）作为一种有前景的框架，通过直接优化选定响应与拒绝响应的对数似然差，实现了大型语言模型（LLM）与人类偏好的对齐。然而现有方法均等对待响应中的所有词元，而人类更关注具有实际意义的部分。这导致偏好优化效果欠佳，因为无关或噪声词元会对DPO损失产生不成比例的影响。为解决这一局限，我们提出基于最优传输的词元加权方案来增强直接偏好优化（OTPO）。通过强化语义重要词元对的权重并弱化相关性较低的部分，本方法引入了一种上下文感知的词元加权机制，从而产生更具对比性的奖励差异估计。这种自适应加权方式增强了奖励稳定性，提高了可解释性，并确保偏好优化聚焦于响应间有意义的差异。大量实验验证了OTPO在不同场景下提升指令跟随能力的有效性（代码详见https://github.com/Mimasss2/OTPO）。

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Abstract

arXiv:2505.18719v1 Announce Type: cross Abstract: Recent high-capacity vision-language-action (VLA) models have demonstrated impressive performance on a range of robotic manipulation tasks by imitating human demonstrations. However, exploiting offline data with limited visited states will cause execution failure in out-of-distribution scenarios. Intuitively, an exploration-based method that improves on online collected data at test time could address this limitation. We present VLA-RL, an algorithmic and systematic framework that leverages online reinforcement learning (RL) to improve pretrained auto-regressive VLAs in downstream tasks. Within a unified perspective, we first introduce a trajectory-level RL formulation for auto-regressive VLA training, which models general robotic manipulation trajectory as multi-modal multi-turn conversation. To address the challenge of sparse rewards, we fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments. To scale up, we identify several implementation findings that improve the stability and efficiency including curriculum selection strategy, GPU-balanced vectorized environments, batch decoding, and critic warmup. VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO, and even matches the performance of advanced commercial models such as $\pi_0$ -FAST. Notably, we observe that VLA-RL benefits from increased test-time optimization, indicating an early spark of inference scaling laws in robotics.

摘要

近期的高容量视觉-语言-动作（VLA）模型通过模仿人类示范，在一系列机器人操作任务中展现出卓越性能。然而，利用状态覆盖有限的离线数据会导致分布外场景下的执行失败。直观上，一种基于探索的方法能够在测试时优化在线收集的数据，从而解决这一局限。我们提出VLA-RL算法与系统框架，该框架利用在线强化学习（RL）提升预训练自回归VLA模型在下游任务中的表现。在统一视角下，我们首先提出面向自回归VLA训练的轨迹级RL建模方法，将通用机器人操作轨迹视为多模态多轮对话。针对稀疏奖励的挑战，我们微调预训练视觉-语言模型作为机器人流程奖励模型，其训练数据基于自动提取的任务片段生成的伪奖励标注。为实现规模化，我们提出了提升稳定性与效率的关键实现技术，包括课程选择策略、GPU负载均衡的向量化环境、批量解码以及评论家网络预热。VLA-RL使OpenVLA-7B模型在LIBERO基准的40项复杂机器人操作任务中超越最强微调基线4.5%，甚至媲美 $\pi_0$ -FAST等先进商业模型性能。值得注意的是，我们发现VLA-RL能从测试时优化中持续获益，这预示着机器人领域推理缩放定律的早期萌芽。

LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning

Abstract

arXiv:2505.18724v1 Announce Type: cross Abstract: Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods.

摘要

量化与微调对于在资源受限的边缘设备上部署大语言模型(LLMs)至关重要。然而，量化模型的微调面临重大挑战，主要源于：首先，低精度量化权重(如4位)与高精度适配权重(如16位)之间的数据类型不匹配，这限制了量化权重在推理时提供的计算效率优势；其次，将这些高精度适配权重合并到低精度量化权重时可能导致精度下降，因为适配权重往往需要近似或截断处理；第三，据我们所知，现有方法均不支持在调整所有量化权重的同时实现适配权重的无损合并。为解决这些挑战，我们提出面向量化感知微调的无损三元适配方法(LoTA-QAF)。这是一种专为量化LLMs设计的新型微调方法，能够将三元适配权重无损合并到量化权重中并调整所有量化权重。LoTA-QAF通过以下组合实现：i) 定制设计的三元适配(TA)，使三元权重与量化网格对齐，并利用这些三元权重调整量化权重；ii) 基于TA的机制实现适配权重的无损合并；iii) 用于更新TA权重的三元符号梯度下降(t-SignSGD)。我们将LoTA-QAF应用于Llama-3.1/3.3和Qwen-2.5模型系列，并在多个下游任务上验证其有效性。在MMLU基准测试中，我们的方法有效恢复了量化模型的性能，较16位LoRA最高提升5.14%。在任务特定微调方面，16位LoRA虽取得更优结果，但LoTA-QAF仍优于其他方法。

Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models

Abstract

arXiv:2505.18773v1 Announce Type: cross Abstract: State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.

摘要

现有最先进的成员推断攻击（MIA）通常需要训练大量参考模型，这使得此类攻击难以扩展到大型预训练语言模型（LLM）。因此，先前研究要么依赖无需训练参考模型的较弱攻击（如微调攻击），要么将强力攻击应用于小规模模型和数据集。然而，较弱攻击已被证明具有脆弱性——其成功率接近随机水平——且在简化场景中强力攻击的洞察无法迁移至当今的LLM。这些挑战引出了一个关键问题：先前工作中观察到的局限性是源于攻击设计选择，还是MIA本质上对LLM无效？我们通过将最强MIA之一的LiRA扩展至参数规模从1000万到10亿的GPT-2架构（在C4数据集上训练超过200亿token的参考模型）来解答该问题。我们的研究从三个关键方面推进了对LLM上MIA的理解：（1）强力MIA可成功作用于预训练LLM；（2）但在实际场景中其有效性仍有限（如AUC<0.7）；（3）MIA成功率与相关隐私指标的关系并非如先前研究暗示的那般直接。

ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models

Abstract

arXiv:2505.18799v1 Announce Type: cross Abstract: Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant costs, including constructing task-specific instruction pairs and extensive training adjustments. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the \textit{\textbf{A}ttention \textbf{L}ocalization and \textbf{P}runing \textbf{S}trategy (\textbf{ALPS})}, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only \textbf{10%} of attention parameters during fine-tuning while achieving a \textbf{2%} performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.

摘要

将通用大语言模型（LLMs）与下游任务对齐通常需要高昂成本，包括构建任务特定的指令对和大量训练调整。先前研究通过最小数据训练或数据驱动激活来识别关键注意力头，探索了多种提升对齐效率的途径。然而这些方法本质上存在数据依赖性，限制了泛化性和可复用性。为解决该问题并提升模型对齐效率，我们提出\textit{\textbf{注意力定位与剪枝策略（ALPS）}}，该高效算法能定位任务最敏感的注意力头，并通过限制注意力训练更新至这些头部来实施剪枝，从而降低对齐成本。实验结果表明，我们的方法在微调期间仅激活\textbf{10%}的注意力参数，同时在三个任务上实现比基线模型\textbf{2%}的性能提升。此外，所识别的任务特定头部具有跨数据集可迁移性，并能缓解知识遗忘。本工作为高效LLM对齐提供了新视角。

HD-PiSSA: High-Rank Distributed Orthogonal Adaptation

Abstract

arXiv:2505.18777v1 Announce Type: cross Abstract: Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce High-rank Distributed PiSSA (HD-PiSSA), a distributed PEFT approach that initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16x higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, we evaluate HD-PiSSA across various challenging downstream tasks, including mathematics, code generation, and multi-task learning. In the multi-task setting, HD-PiSSA achieves average gains of 10.0 absolute points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks, demonstrating its benefits from the extra optimization flexibility.

摘要

现有针对大语言模型（LLM）的参数高效微调方法（如LoRA和PiSSA）将模型更新限制在低秩子空间，制约了其表达能力，导致复杂任务性能欠佳。为此，我们提出高秩分布式PiSSA（HD-PiSSA），该方法通过在不同设备上初始化正交适配器，并聚合其对权重矩阵W的增量更新进行分布式微调。与数据并行的LoRA或PiSSA保持所有设备适配器一致不同，HD-PiSSA为每个GPU分配预训练权重矩阵的不同主成分，从而显著扩展更新方向的范围。当在8个GPU上以相同设备级适配器秩进行微调时，其有效更新秩达到数据并行LoRA或PiSSA的16倍以上。实验评估表明，在数学推理、代码生成和多任务学习等具有挑战性的下游任务中，HD-PiSSA表现优异。多任务场景下，该方法在12个基准测试中平均较LoRA提升10.0个绝对百分点（14.63%），较PiSSA提升4.98个点（6.60%），充分证明了其额外优化灵活性带来的优势。

REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Abstract

arXiv:2505.18880v1 Announce Type: cross Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.

摘要

短视频是推广内容和提升知识可及性的有效工具。现有抽取式视频摘要方法难以生成连贯的叙事，而生成式方法则无法从输入视频中"引用"内容，即在输出中插入短视频片段。本研究探索了一种新型视频编辑模型，用于生成兼具连贯叙事和长视频片段引用的短视频。我们提出了一种创新的检索嵌入生成框架，使大语言模型在保持叙事连贯性的同时能够引用多模态资源。所提出的REGen系统首先通过微调的大语言模型生成带有引用占位符的故事脚本，随后利用新型检索模型从候选视频片段池中选择最能支撑叙事的片段进行替换填充。我们在纪录片预告片生成任务上验证了该方法，该场景中常通过简短采访片段来强化叙事。客观评估表明，该方法能有效插入短视频片段并保持叙事连贯性。主观调研显示，在预告片生成的连贯性、对齐度和真实性方面，本方法优于现有生成式和抽取式方法。

Writing Like the Best: Exemplar-Based Expository Text Generation

Abstract

arXiv:2505.18859v1 Announce Type: cross Abstract: We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics--imitativeness, adaptiveness, and adaptive-imitativeness--using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.

摘要

我们提出"基于范例的说明文生成"任务，旨在利用相似主题的范例生成新主题的说明文。现有方法因依赖大量范例数据、难以适配主题特定内容及长文本连贯性问题而存在局限。为解决这些挑战，我们提出"自适应模仿"概念，并设计新型"循环规划-适配"框架（RePA）。该框架通过细粒度"规划-适配"流程，利用大语言模型实现有效自适应模仿。RePA还支持基于两种记忆结构的逐段循环模仿，从而提升输入清晰度与输出连贯性。我们采用大语言模型作为评估器，开发了任务特异性指标——模仿度、适配度及自适应模仿度。在自建的三个多样化数据集上的实验结果表明，RePA在生成事实准确、内容一致且主题相关的文本方面优于现有基线方法。

PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models

Abstract

arXiv:2505.18901v1 Announce Type: cross Abstract: The rapid advancement of generative AI models has provided users with numerous options to address their prompts. When selecting a generative AI model for a given prompt, users should consider not only the performance of the chosen model but also its associated service cost. The principle guiding such consideration is to select the least expensive model among the available satisfactory options. However, existing model-selection approaches typically prioritize performance, overlooking pricing differences between models. In this paper, we introduce PromptWise, an online learning framework designed to assign a sequence of prompts to a group of large language models (LLMs) in a cost-effective manner. PromptWise strategically queries cheaper models first, progressing to more expensive options only if the lower-cost models fail to adequately address a given prompt. Through numerical experiments, we demonstrate PromptWise's effectiveness across various tasks, including puzzles of varying complexity and code generation/translation tasks. The results highlight that PromptWise consistently outperforms cost-unaware baseline methods, emphasizing that directly assigning prompts to the most expensive models can lead to higher costs and potentially lower average performance.

摘要

生成式AI模型的快速发展为用户提供了多种响应提示的选择。在为给定提示选择生成式AI模型时，用户不仅应考虑所选模型的性能，还需关注其相关服务成本。指导原则是在可用的满意选项中选择成本最低的模型。然而，现有模型选择方法通常优先考虑性能，忽视了模型间的定价差异。本文提出PromptWise——一个在线学习框架，旨在以经济高效的方式将提示序列分配给一组大语言模型(LLM)。该策略首先查询成本较低的模型，仅当低成本模型无法充分响应提示时，才会转向更昂贵的选项。通过数值实验，我们验证了PromptWise在不同任务中的有效性，包括复杂度各异的谜题解决及代码生成/翻译任务。结果表明：PromptWise始终优于不考虑成本的基线方法，这证实直接为提示分配最昂贵模型会导致更高成本，并可能降低平均性能。

Security Concerns for Large Language Models: A Survey

Abstract

arXiv:2505.18889v1 Announce Type: cross Abstract: Large Language Models (LLMs) such as GPT-4 (and its recent iterations like GPT-4o and the GPT-4.1 series), Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks (including input perturbations and data poisoning), misuse by malicious actors (e.g., for disinformation, phishing, and malware generation), and worrisome risks inherent in autonomous LLM agents. A significant focus has been recently placed on the latter, exploring goal misalignment, emergent deception, self-preservation instincts, and the potential for LLMs to develop and pursue covert, misaligned objectives (scheming), which may even persist through safety training. We summarize recent academic and industrial studies (2022-2025) that exemplify each threat, analyze proposed defenses and their limitations, and identify open challenges in securing LLM-based applications. We conclude by emphasizing the importance of advancing robust, multi-layered security strategies to ensure LLMs are safe and beneficial.

摘要

诸如GPT-4（及其近期迭代版本如GPT-4o和GPT-4.1系列）、谷歌Gemini、Anthropic的Claude 3模型以及xAI的Grok等大语言模型（LLMs）引发了自然语言处理领域的革命，但其能力也带来了新的安全漏洞。本综述全面梳理了围绕LLMs新出现的安全问题，将威胁归类为提示注入与越狱、对抗攻击（包括输入扰动和数据投毒）、恶意行为者滥用（如用于虚假信息、钓鱼攻击和恶意软件生成）以及自主LLM智能体固有的高风险隐患。近期研究重点聚焦于最后一项，探讨目标错位、涌现性欺骗、自我保存本能，以及LLMs可能形成并追求隐蔽且错位目标（密谋）的潜在风险——这些行为甚至可能在安全训练后持续存在。我们汇总了2022-2025年间体现各类威胁的学术与产业研究实例，分析现有防御方案及其局限性，并指出保障LLM应用安全面临的开放挑战。最后强调必须发展鲁棒的多层安全策略，以确保LLMs的安全性与有益性。

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Abstract

arXiv:2505.18878v1 Announce Type: cross Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

摘要

虽然AI代理在商业领域具有变革潜力，但由于广泛使用平台上公开、真实的商业数据匮乏，其性能基准测试的有效性受到制约。现有基准测试在环境模拟、数据真实性及代理-用户交互方面往往缺乏保真度，且对多样化商业场景和行业的覆盖有限。为弥补这些不足，我们推出CRMArena-Pro——一个用于全面、真实评估大语言模型代理在多元专业场景中表现的新型基准测试体系。该系统在CRMArena基础上扩展，涵盖销售、服务和"配置-定价-报价"流程的19项专家验证任务，同时支持企业级(B2B)和消费者级(B2C)场景。其独特之处在于整合了多角色引导的多轮次交互机制，以及严格的保密意识评估体系。实验数据显示，领先的大语言模型代理在CRMArena-Pro单轮测试中成功率仅约58%，而在多轮交互环境下性能显著下降至35%左右。尽管工作流执行对顶级代理更具可操作性（单轮成功率超83%），其他被测商业技能则表现出更大挑战性。此外，代理表现出近乎零的固有保密意识，虽然针对性提示能改善此问题，但往往以任务性能下降为代价。这些发现揭示了当前大语言模型能力与企业需求间的显著差距，凸显了在多轮推理、保密合规及多技能习得等方面进行技术突破的必要性。

Behavior Injection: Preparing Language Models for Reinforcement Learning

Abstract

arXiv:2505.18917v1 Announce Type: cross Abstract: Reinforcement fine-tuning (RFT) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RFT: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data-augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increases the performance gain from RFT over the pre-RL model.

摘要

强化微调（RFT）已成为一种强大的训练后技术，用于增强大语言模型（LLMs）的推理能力。然而，LLMs对RFT的反应可能非常不一致：部分模型表现出显著的性能提升，而其他模型则停滞不前甚至性能下降。为理解这种差异，我们分析了RL目标在每一步的影响，并确定了有效训练后的两个关键条件：（1）RL信息化的推演准确率，以及（2）强数据共影响力（用于量化训练数据对其他样本性能的影响程度）。基于这些发现，我们提出了行为注入——一种在RL之前应用的与任务无关的数据增强方案。该方案通过植入探索性和利用性行为来丰富监督微调（SFT）数据，从而有效提升模型的RL适应性。我们在两个推理基准测试中采用多种基础模型评估了该方法。结果表明，这种理论驱动的数据增强能显著提高RFT相对于RL前模型的性能增益。

The Price of Format: Diversity Collapse in LLMs

Abstract

arXiv:2505.18949v1 Announce Type: cross Abstract: Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.

摘要

指令调优的大型语言模型（LLMs）采用结构化模板（如角色标记和特殊符号）以在推理过程中保持格式一致性。然而，我们发现这种格式化存在一个关键缺陷：它会导致我们称之为"多样性坍缩"的现象，即模型针对开放式输入生成语义相似的输出，从而削弱了创造性和变异性。我们通过故事补全和自由生成等任务系统评估了这一效应，发现：（1）即使在高温度采样下，多样性坍缩依然存在；（2）模板中的结构符号显著限制了模型的输出空间。为量化这些发现，我们使用一系列结构化提示对同一模型进行微调，并从三个维度进行评估：下游任务表现、对齐行为和输出多样性。分析表明，微调与推理间的格式一致性对结构敏感型任务（如GSM8K、IFEval）至关重要，但对知识密集型任务（如MMLU、WebQuestions）影响有限。相比之下，输出多样性主要受结构符号存在与否的调控，最小化格式化能产生最多样化的输出。这些发现揭示，当前提示规范虽有利于对齐，却可能无意中抑制输出多样性，这凸显了设计多样性感知的提示模板和指令调优的必要性。

Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments

Abstract

arXiv:2505.18927v1 Announce Type: cross Abstract: As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen's kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.

摘要

随着网络平台的发展，评论区日益增多的骚扰行为损害了用户体验和心理健康。本研究以游戏、生活方式、美食视频博客和音乐频道中高攻击性讨论区的5,080条YouTube评论为样本，对OpenAI GPT-4.1、Google Gemini 1.5 Pro和Anthropic Claude 3 Opus三大主流大语言模型进行基准测试。该数据集包含1,334条有害信息和3,746条无害信息，涵盖英语、阿拉伯语和印尼语，经两位评审员独立标注且具有高度一致性（Cohen's kappa = 0.83）。采用统一提示词和确定性参数设置时，GPT-4.1以0.863的F1值、0.887的精确率和0.841的召回率取得最佳综合平衡。Gemini标记有害内容的比例最高（召回率=0.875），但因频繁误报导致精确率降至0.767。Claude以0.920的精确率和0.022的最低误报率表现最优，但其召回率下降至0.720。定性分析表明，三种模型均难以识别讽刺、隐晦侮辱和混合语言俚语。这些结果凸显了建立内容审核管道的必要性：需整合互补模型、结合对话语境，并针对低资源语言和隐性攻击进行优化。本研究公开了匿名化数据集和完整提示词，以促进自动化内容审核研究的可重复性和进一步发展。

FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)

Abstract

arXiv:2505.18995v1 Announce Type: cross Abstract: This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.

摘要

本研究提出FiLLM——一个针对菲律宾语优化的开源大语言模型，旨在提升菲律宾语自然语言处理（NLP）能力。该模型基于SeaLLM-7B 2.5架构，采用低秩自适应（LoRA）微调技术，在保持任务特定性能的同时优化内存效率。研究通过多样化的菲律宾语数据集对模型进行训练与评估，涵盖命名实体识别（NER）、词性标注（POS）、依存句法分析和文本摘要等核心NLP任务。与CalamanCy模型的性能对比采用F1值、精确率、召回率、压缩率和关键词重叠度等指标。结果表明，CalamanCy在多项指标上优于FiLLM，展现出其在菲律宾语文本处理中更强的语言理解能力与适应性。本研究通过开发针对本土语言需求定制的高效、可扩展优化模型，为菲律宾语NLP应用的发展做出贡献。

An Initial Exploration of Fine-tuning Small Language Models for Smart Contract Reentrancy Vulnerability Detection

Abstract

arXiv:2505.19059v1 Announce Type: cross Abstract: Large Language Models (LLMs) are being used more and more for various coding tasks, including to help coders identify bugs and are a promising avenue to support coders in various tasks including vulnerability detection -- particularly given the flexibility of such generative AI models and tools. Yet for many tasks it may not be suitable to use LLMs, for which it may be more suitable to use smaller language models that can fit and easily execute and train on a developer's computer. In this paper we explore and evaluate whether smaller language models can be fine-tuned to achieve reasonable results for a niche area: vulnerability detection -- specifically focusing on detecting the reentrancy bug in Solidity smart contracts.

摘要

大型语言模型（LLMs）正越来越多地应用于各类编码任务，包括帮助程序员识别代码缺陷，并因其生成式人工智能模型与工具的高度灵活性，成为支持漏洞检测等多样化任务的重要途径。然而对于许多场景而言，使用LLMs可能并不适宜，此时更适合采用能在开发者计算机上轻松部署、训练和执行的较小规模语言模型。本文针对特定领域——智能合约漏洞检测（尤其侧重于Solidity合约中的重入漏洞识别），系统探究并评估了通过微调小型语言模型能否获得有效检测效果。

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Abstract

arXiv:2505.19028v1 Announce Type: cross Abstract: Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.

摘要

理解包含设计驱动视觉元素（如图标、符号）的信息图表需要视觉识别与推理能力，这对多模态大语言模型（MLLMs）提出了挑战。然而，由于缺乏配对的普通图表和基于视觉元素的问题，现有视觉问答基准难以有效评估MLLMs的上述能力。为填补这一空白，我们提出InfoChartQA基准，用于评估MLLMs在信息图表理解上的表现。该基准包含5,642对信息图表与普通图表，每对共享相同底层数据但呈现形式不同。我们进一步设计了基于视觉元素的问题，以捕捉其独特的视觉设计及传达意图。对20个MLLMs的评估表明，模型在信息图表上的性能显著下降，尤其表现在涉及隐喻的视觉元素问题上。配对的图表设计支持细粒度错误分析与消融研究，揭示了提升MLLMs信息图表理解能力的新机遇。项目已发布于https://github.com/CoolDawnAnt/InfoChartQA。

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Abstract

arXiv:2505.19056v1 Announce Type: cross Abstract: Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

摘要

大型语言模型（LLMs）通常通过拒绝有害指令来遵循安全准则。最近出现了一种名为"消融攻击"（abliteration）的攻击方法，该方法通过隔离并抑制导致拒绝行为的最关键潜在方向，使模型能够生成不道德内容。我们提出了一种改进模型拒绝生成机制的防御方案。首先构建了一个扩展拒绝数据集，其中包含有害提示及完整阐述拒绝理由的回应文本。随后基于该数据集对Llama-2-7B-Chat和Qwen2.5-Instruct（15亿和30亿参数）模型进行微调，并在有害提示集上评估改进后的系统。实验表明，扩展拒绝模型能保持90%以上的高拒绝率，而基线模型在消融攻击后拒绝率下降70-80%。综合安全性与实用性的评估显示，扩展拒绝微调既能有效抵御消融攻击，又保持了模型的整体性能。

Abstract

arXiv:2505.19108v1 Announce Type: cross Abstract: Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.

摘要

研究大语言模型（LLMs）在跨语言与跨模态场景中的幻觉问题，对推动其在实际应用中的大规模部署具有重要意义。然而，现有研究仅局限于单一场景（跨语言或跨模态），尚未探索跨语言与跨模态联合场景下的幻觉现象。为此，我们提出了首个跨语言与跨模态联合幻觉基准（CCHall）以填补这一空白。具体而言，CCHall同时涵盖跨语言和跨模态幻觉场景，可用于评估LLMs的跨语言与跨模态能力。此外，我们对主流开源与闭源LLMs进行了全面评测，实验结果表明当前LLMs在CCHall上仍面临显著挑战。我们希望CCHall能成为评估跨语言与跨模态联合场景下LLMs性能的重要资源。

Medical Large Vision Language Models with Multi-Image Visual Ability

Abstract

arXiv:2505.19031v1 Announce Type: cross Abstract: Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs' multi-image understanding capabilities in the medical domain.

摘要

医学大型视觉语言模型（LVLM）在各种单图像问答（QA）基准测试中展现出优异性能，但其处理多图像临床场景的能力仍待探索。与基于单图像的任务不同，涉及多图像的医学任务通常需要复杂的视觉理解能力（如时序推理和跨模态分析），而当前医学LVLM对此类能力的支持严重不足。为填补这一关键空白，我们提出了Med-MIM指令数据集，包含83.2K个涵盖四种多图像视觉能力（时序理解、推理、比较、共指）的医学多图像问答对。基于该数据集，我们对Mantis和LLaVA-Med进行微调，得到两个专精于多图像分析的医学视觉语言模型：MIM-LLaVA-Med和Med-Mantis。此外，我们开发了Med-MIM基准测试，用于全面评估LVLM的医学多图像理解能力。我们对包括两个新模型在内的八种主流LVLM进行了测试，实验结果表明：Med-Mantis和MIM-LLaVA-Med在Med-MIM基准测试的保留集和外部集上均表现卓越，证实Med-MIM指令数据集能有效提升LVLM在医学领域的多图像理解能力。

FP4 All the Way: Fully Quantized Training of LLMs

Abstract

arXiv:2505.19115v1 Announce Type: cross Abstract: We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

摘要

我们首次展示了在多达2000亿标记的数据集上，主要使用4位浮点（FP4）精度对大型语言模型（LLM）进行全量化训练（FQT），涵盖权重、激活值和梯度。我们深入研究了FP4的关键设计选择，包括块大小、缩放格式和舍入方法。分析表明，采用NVFP4格式（即每16个FP4值（E2M1）共享一个E4M3表示的缩放因子）可获得最佳结果。在反向传播和参数更新阶段使用随机舍入，前向传播阶段采用就近舍入以增强稳定性。此外，我们发现量化训练有效性的理论及实证阈值：当梯度范数低于量化噪声约 $\sqrt{3}$ 倍时，量化训练效果会下降。基于这些发现，我们在256个英特尔Gaudi2加速器上成功训练了一个70亿参数模型。该FP4训练模型在下游任务中达到与标准BF16基线相当的性能，证实FP4训练是大规模LLM训练中实用且高效的方法。参考实现详见https://github.com/Anonymous1252022/fp4-all-the-way。

RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models

Abstract

arXiv:2505.19128v1 Announce Type: cross Abstract: The rise of large language models has led to significant performance breakthroughs in named entity recognition (NER) for high-resource languages, yet there remains substantial room for improvement in low- and medium-resource languages. Existing multilingual NER methods face severe language interference during the multi-language adaptation process, manifested in feature conflicts between different languages and the competitive suppression of low-resource language features by high-resource languages. Although training a dedicated model for each language can mitigate such interference, it lacks scalability and incurs excessive computational costs in real-world applications. To address this issue, we propose RetrieveAll, a universal multilingual NER framework based on dynamic LoRA. The framework decouples task-specific features across languages and demonstrates efficient dynamic adaptability. Furthermore, we introduce a cross-granularity knowledge augmented method that fully exploits the intrinsic potential of the data without relying on external resources. By leveraging a hierarchical prompting mechanism to guide knowledge injection, this approach advances the paradigm from "prompt-guided inference" to "prompt-driven learning." Experimental results show that RetrieveAll outperforms existing baselines; on the PAN-X dataset, it achieves an average F1 improvement of 12.1 percent.

摘要

大型语言模型的兴起使得高资源语言在命名实体识别（NER）任务上取得显著性能突破，但中低资源语言仍有较大提升空间。现有多语言NER方法在多语言适配过程中面临严重的语言干扰问题，表现为不同语言间的特征冲突以及高资源语言对低资源语言特征的竞争性抑制。尽管为每种语言训练专用模型可缓解此类干扰，但该方法缺乏可扩展性，且在实际应用中会产生过高计算成本。为解决这一问题，我们提出RetrieveAll——一个基于动态LoRA的通用多语言NER框架。该框架实现了跨语言任务特征的解耦，并展现出高效的动态适应能力。此外，我们提出一种跨粒度知识增强方法，在不依赖外部资源的情况下充分挖掘数据内在潜力。通过采用分层提示机制引导知识注入，该方法将范式从"提示引导推理"推进至"提示驱动学习"。实验结果表明，RetrieveAll优于现有基线模型；在PAN-X数据集上平均F1值提升达12.1%。

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

Abstract

arXiv:2505.19163v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various disciplines and tasks. However, benchmarking their capabilities with multilingual spoken queries remains largely unexplored. In this study, we introduce SpokenNativQA, the first multilingual and culturally aligned spoken question-answering (SQA) dataset designed to evaluate LLMs in real-world conversational settings. The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages, including low-resource and dialect-rich languages, providing a robust benchmark for assessing LLM performance in speech-based interactions. SpokenNativQA addresses the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. We benchmark different ASR systems and LLMs for SQA and present our findings. We released the data at (https://huggingface.co/datasets/QCRI/SpokenNativQA) and the experimental scripts at (https://llmebench.qcri.org/) for the research community.

摘要

大语言模型（LLMs）已在多学科和任务中展现出卓越性能。然而，针对多语言口语查询的能力基准测试仍存在较大研究空白。本研究推出SpokenNativQA——首个为评估现实对话场景中LLMs表现而设计的、具备多语言与文化对齐特性的口语问答（SQA）数据集。该数据集包含约33,000条自然口语形式的多语言问答对，涵盖资源稀缺和方言丰富的语种，为语音交互场景下的LLM性能评估提供了可靠基准。SpokenNativQA通过纳入语音变异、口音及语言多样性，弥补了文本问答数据集的局限性。我们对不同自动语音识别系统及LLMs进行了SQA基准测试并呈现结果。相关数据（https://huggingface.co/datasets/QCRI/SpokenNativQA）与实验脚本（https://llmebench.qcri.org/）已向研究社区开源。

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Abstract

arXiv:2505.19147v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.

摘要

在本文立场论文中，我们主张高效人工智能的研究重心正从模型中心化压缩转向数据中心化压缩。我们将令牌压缩确立为新前沿领域，其通过减少模型训练或推理时的令牌数量来提升AI效率。通过全面分析，首先考察了跨领域长上下文AI的最新进展，建立了现有模型效率策略的统一数学框架，论证了为何令牌压缩是解决长上下文开销的关键范式转变。随后系统梳理了令牌压缩的研究格局，分析其基础优势并揭示其在多元场景中的显著价值。进一步深入探讨了当前令牌压缩研究的核心挑战，并展望了未来发展方向。本研究旨在为AI效率提供新视角，整合现有成果，并推动创新突破以应对日益增长的上下文长度对AI领域发展提出的挑战。

OptiMindTune: A Multi-Agent Framework for Intelligent Hyperparameter Optimization

Abstract

arXiv:2505.19205v1 Announce Type: cross Abstract: Hyperparameter optimization (HPO) is a critical yet challenging aspect of machine learning model development, significantly impacting model performance and generalization. Traditional HPO methods often struggle with high dimensionality, complex interdependencies, and computational expense. This paper introduces OptiMindTune, a novel multi-agent framework designed to intelligently and efficiently optimize hyperparameters. OptiMindTune leverages the collaborative intelligence of three specialized AI agents -- a Recommender Agent, an Evaluator Agent, and a Decision Agent -- each powered by Google's Gemini models. These agents address distinct facets of the HPO problem, from model selection and hyperparameter suggestion to robust evaluation and strategic decision-making. By fostering dynamic interactions and knowledge sharing, OptiMindTune aims to converge to optimal hyperparameter configurations more rapidly and robustly than existing single-agent or monolithic approaches. Our framework integrates principles from advanced large language models, and adaptive search to achieve scalable and intelligent AutoML. We posit that this multi-agent paradigm offers a promising avenue for tackling the increasing complexity of modern machine learning model tuning.

摘要

超参数优化（HPO）是机器学习模型开发中关键但具有挑战性的环节，对模型性能与泛化能力影响显著。传统HPO方法常受限于高维度、复杂参数关联及高昂计算成本。本文提出OptiMindTune——一种新型多智能体框架，旨在智能高效地优化超参数。该框架利用三个由Google Gemini模型驱动的专业AI智能体（推荐智能体、评估智能体与决策智能体）的协同智能，分别处理HPO问题的不同层面，包括模型选择、超参数建议、鲁棒性评估及策略决策。通过促进动态交互与知识共享，OptiMindTune相比现有单智能体或整体式方法能以更快速度、更强鲁棒性收敛至最优超参数配置。本框架融合了先进大语言模型与自适应搜索原理，实现可扩展的智能自动化机器学习。我们认为这种多智能体范式为解决现代机器学习模型调参日益增长的复杂性提供了可行路径。

POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval

Abstract

arXiv:2505.19189v1 Announce Type: cross Abstract: Although Multi-Vector Retrieval (MVR) has achieved the state of the art on many information retrieval (IR) tasks, its performance highly depends on how to decompose queries into smaller pieces, say phrases or tokens. However, optimizing query decomposition for MVR performance is not end-to-end differentiable. Even worse, jointly solving this problem and training the downstream retrieval-based systems, say RAG systems could be highly inefficient. To overcome these challenges, we propose Performance-Oriented Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD leverages one LLM for query decomposition and searches the optimal prompt with an LLM-based optimizer. We further propose an end-to-end training algorithm to alternatively optimize the prompt for query decomposition and the downstream models. This algorithm can achieve superior MVR performance at a reasonable training cost as our theoretical analysis suggests. POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems. Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy. POQD is available at https://github.com/PKU-SDS-lab/POQD-ICML25.

摘要

尽管多向量检索（MVR）在许多信息检索（IR）任务中达到了最先进的性能，但其效果高度依赖于如何将查询分解为更小的片段（如短语或词元）。然而，为优化MVR性能而进行的查询分解并非端到端可微分。更严重的是，将该问题与下游基于检索的系统（如RAG系统）联合训练时效率极低。为克服这些挑战，我们提出面向性能的查询分解器（POQD）——一种新型的MVR查询分解框架。POQD利用一个大语言模型（LLM）进行查询分解，并通过基于LLM的优化器搜索最优提示。我们进一步提出一种端到端训练算法，交替优化查询分解提示与下游模型。理论分析表明，该算法能以合理训练成本实现卓越的MVR性能。POQD可无缝集成至任意基于检索的系统（如检索增强生成系统）。在典型RAG问答任务上的大量实验表明，POQD在检索性能和端到端问答准确率上均优于现有查询分解策略。POQD代码已开源：https://github.com/PKU-SDS-lab/POQD-ICML25。

To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers

Abstract

arXiv:2505.19245v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks and to theoretically enhance expressivity by recursively increasing the number of computational steps. However, their comparative capabilities are still not well understood. In this paper, we provide a formal analysis of their respective strengths and limitations. We show that Looped Transformers can efficiently simulate parallel computations for deterministic tasks, which we formalize as evaluation over directed acyclic graphs. In contrast, CoT with stochastic decoding excels at approximate inference for compositional structures, namely self-reducible problems. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical cues for choosing between reasoning paradigms.

摘要

思维链（CoT）与循环变压器已被实证证明能提升推理任务表现，并在理论上通过递归增加计算步数来增强表达能力。然而，二者的相对能力仍未被充分理解。本文对其各自优势与局限进行了形式化分析：循环变压器可高效模拟确定性任务（形式化为有向无环图求值）的并行计算，而采用随机解码的CoT则擅长组合结构（即自可归约问题）的近似推理。这些差异揭示了深度驱动递归更适用的任务类型，从而为推理范式的选择提供了实践依据。

Abstract

arXiv:2505.19187v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9% to +6.6%) with significantly reduced token usage (-3% to -41%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

摘要

大型语言模型（LLMs）通过测试时扩展方法展现出卓越的推理能力，尤其是在使用从更强大的大型推理模型（LRMs）中提炼的思维链（CoT）数据进行微调时。然而，这些推理链通常包含反映人类问题解决过程的冗余元素，可分为渐进式推理（核心解决方案的构建路径）和功能性元素（验证过程、替代解法及错误修正）。虽然渐进式推理至关重要，但功能性元素会显著增加测试时推理的计算负担。我们提出PIR（基于困惑度的重要性优化框架），该原则性框架通过量化评估每个推理步骤对答案预测置信度的影响来确定其重要性。PIR系统性地识别并选择性剪枝低重要性功能步骤，同时保留渐进式推理成分，从而生成保持核心解决路径完整性且减少冗余的优化训练数据。基于PIR优化数据微调的模型展现出更优的测试时扩展特性：在AIME、AMC和GPQA Diamond等具有挑战性的推理基准测试中，模型生成的推理链更简洁，准确率提升（+0.9%至+6.6%），同时显著降低token使用量（-3%至-41%）。该方法在不同模型规模、数据源和token预算条件下均表现出强泛化能力，为在测试时扩展效率、响应时间和计算效率受限场景中部署具备推理能力的LLMs提供了实用解决方案。

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Abstract

arXiv:2505.19241v1 Announce Type: cross Abstract: The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.

摘要

近期利用人类偏好对齐大型语言模型（LLMs）的成功显著提升了其在问答、数学推理和代码生成等下游任务中的表现。然而，实现有效的LLM对齐依赖于高质量的人类偏好数据集。收集这些数据需要进行人工偏好标注，成本高昂且资源密集，因此需要高效的数据主动选择方法。现有方法要么缺乏坚实的理论基础，要么依赖于严格的奖励函数假设（如线性）。为此，我们提出了一种算法ActiveDPO，该算法基于理论依据为非线性的奖励函数设计数据选择标准，并直接利用LLM本身参数化用于主动数据选择的奖励模型。与不考虑待对齐LLM影响的数据选择方法不同，ActiveDPO显式地考虑了LLM对数据选择的影响，从而实现更高效的数据收集。大量实验表明，ActiveDPO在不同模型和数据集上均优于现有方法。

Two LLMs debate, both are certain they've won

Abstract

arXiv:2505.19184v1 Announce Type: cross Abstract: Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.

摘要

大型语言模型能否在面对反对意见时准确调整其置信度？基于先前针对静态事实问答任务校准度的研究，我们在动态对抗性辩论场景中评估了大语言模型（LLMs），该设置独特地结合了两个现实因素：（a）需要模型根据新出现信息更新信念的多轮对话形式；（b）用于控制任务相关不确定性的零和结构——因为双方同时高置信度的主张意味着系统性过度自信。我们组织了十种前沿LLM参与的60场三轮政策辩论，模型在每轮结束后私下评估其获胜置信度（0-100）。观察到五个值得关注的现象：（1）系统性过度自信：模型初始平均置信度为72.9%，而理性基线应为50%；（2）置信度升级：随着辩论推进，辩手反而提高获胜概率，最终轮平均达83%；（3）相互高估：61.7%的辩论中出现双方同时宣称≥75%胜率的逻辑矛盾；（4）持续性自我辩论偏差：与相同副本辩论时，模型置信度从64.1%升至75.2%；即使明确告知胜率应为50%，置信度仍从50.0%上升至57.1%；（5）非对齐的私有推理：模型的私有推理过程有时与其公开置信度评级不一致，引发对思维链推理可信度的担忧。这些结果表明LLMs在动态多轮任务中缺乏准确自我评估或更新信念的能力，当LLM输出被未经审慎核查地部署于助手角色或自主场景时，将构成重大隐患。

Abstract

arXiv:2505.19212v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self-interest" may conflict with ethical expectations. Our code is available at https://github.com/sbackmann/moralsim.

摘要

大语言模型（LLMs）的最新进展使其能够承担复杂的代理角色，涉及与人类或其他代理的决策过程，这使得伦理对齐成为人工智能安全的关键问题。尽管先前研究已考察过LLMs在社会困境中的道德判断和策略行为，但对其在道德要求与利益激励直接冲突时的行为机制仍缺乏深入理解。为此，我们开发了'社会困境模拟中的道德行为'（MoralSim）系统，通过囚徒困境和公共物品博弈的道德情境设置来评估LLMs的行为模式。在MoralSim中，我们测试了多种前沿模型在两种博弈结构和三种不同道德框架下的表现，从而系统性地研究LLMs如何在伦理规范与收益最大化策略相冲突的社会困境中做出抉择。研究结果显示，不同模型在道德行为总体倾向性方面存在显著差异，其行为一致性也随博弈类型、特定道德框架以及对手行为、生存风险等情境因素而变化。关键发现是：所有模型在MoralSim中均未表现出持续稳定的道德行为，这警示我们在LLMs可能面临'自身利益'与伦理期望冲突的代理角色部署中需保持审慎。代码开源地址：https://github.com/sbackmann/moralsim。

LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Abstract

arXiv:2505.19240v1 Announce Type: cross Abstract: Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that LLM-related research increases over fivefold in ACL and fourfold in arXiv. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by late 2024. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2024. We release a dataset of annotated abstracts and a validated methodology, and offer a quantitative view of trends in LLM limitations research.

摘要

随着大语言模型（LLM）研究的快速发展，人们对其局限性（如推理失败、幻觉问题及多语言能力不足）的关注也日益增加。本综述采用自下而上的方法，对2022至2024年间关于LLM局限性（LLLMs）的研究进行了数据驱动的半自动化回顾。通过从25万篇ACL和arXiv论文中筛选，我们结合关键词过滤、基于LLM的分类（经专家标注验证）以及主题聚类（采用HDBSCAN+BERTopic与LlooM两种方法），最终确定了14,648篇相关文献。研究发现：ACL中LLM相关研究增长超五倍，arXiv中增长四倍；自2022年起，LLLMs研究增速更快，至2024年末已占LLM论文的30%以上。推理仍是研究最多的局限领域，其次为泛化性、幻觉、偏见和安全性。ACL数据集的主题分布相对稳定，而arXiv在2022至2024年间转向安全可控性（如安全风险、对齐、幻觉、知识编辑等主题）与多模态研究。我们公开了标注摘要数据集及验证方法，为LLM局限性研究趋势提供了量化视角。

MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

Abstract

arXiv:2505.19209v1 Announce Type: cross Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.

摘要

大语言模型（LLMs）在自动化科学假设生成方面展现出潜力，但现有方法主要产生粗粒度的假设，缺乏关键的方法论和实验细节。我们提出并正式定义了细粒度科学假设发现这一新任务，其目标是从初始的粗粒度研究方向生成详细且可实验操作的假设。我们将此任务构建为一个组合优化问题，并探究在最大限度利用LLMs时其解决该问题的能力上限。具体而言，我们探讨了四个基础问题：（1）如何最佳利用LLM的内部启发式方法，使其生成自身基于内部评分认为最有潜力的细粒度假设——从而在假设空间上定义一个潜在的奖励景观；（2）此类由LLM判定为更优的假设是否与真实假设表现出更强的一致性；（3）使用一组能力相近的多样化LLM塑造奖励景观，是否比使用其中最强LLM的重复实例定义奖励景观能产生更好的结果；（4）一组相同的LLM是否比单个LLM提供更可靠的奖励景观。为解决这些问题，我们提出了一种分层搜索方法，该方法从一般概念逐步推进到具体实验配置，逐步提出并将细节整合到假设中。我们证明这一分层过程能够平滑奖励景观并实现更有效的优化。在基于近期化学文献中专家标注的细粒度假设新基准上的实证评估表明，我们的方法 consistently 优于强基线模型。

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Abstract

arXiv:2505.19255v1 Announce Type: cross Abstract: Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.

摘要

强化学习微调（RFT）通过实现长链思维、自我修正和有效工具使用，显著提升了大型语言模型（LLMs）的推理能力。尽管近期研究尝试将RFT扩展至视觉语言模型（VLMs），但这些工作主要生成基于静态图像输入的纯文本推理，未能实现响应中真正的多模态推理。相比之下，像Visual Sketchpad这样的测试时方法虽然包含视觉步骤，但缺乏训练机制。我们提出VTool-R1——首个通过交错文本与中间视觉推理步骤来训练VLMs生成多模态思维链的框架。VTool-R1将基于Python的视觉编辑工具集成到RFT流程中，使VLMs能学习何时及如何生成有益于最终推理的视觉推理步骤。通过绑定任务准确度的结果导向奖励进行训练，我们的方法无需依赖过程监督即可激发策略性视觉工具使用以支持推理。在图表结构化视觉问答任务上的实验表明，VTool-R1通过教导VLMs"用图像思考"并生成基于工具的多模态思维链，显著提升了推理性能。

Towards Large Reasoning Models for Agriculture

Abstract

arXiv:2505.19259v1 Announce Type: cross Abstract: Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: https://baskargroup.github.io/Ag_reasoning/

摘要

农业决策涉及复杂且情境特定的推理过程，作物选择、实践措施及干预方案的制定高度依赖于地理、气候和经济条件。传统大语言模型（LLMs）由于推理能力有限，往往难以应对这种具有细微差异的问题。我们假设近期发展的大规模推理模型（LRMs）能更好地处理此类结构化、领域特定的推断任务。为验证该假设，我们推出AgReason——首个由专家构建的开放式科学基准测试，包含100道农业推理问题。通过对13个开源和专有模型的评估发现，尽管仍存在显著挑战，LRMs表现优于传统模型，其中基于Gemini的最强基线模型准确率达到36%。我们还提出AgThoughts数据集，该大规模数据集包含44.6K个经人工监督生成的问答对，并配备合成生成的推理轨迹。利用AgThoughts，我们开发了可在消费级GPU上运行的轻量级推理模型套件AgThinker，证明该数据集能有效激发LLMs的农业推理能力。项目页面详见：https://baskargroup.github.io/Ag_reasoning/

Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Abstract

arXiv:2505.19261v1 Announce Type: cross Abstract: Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.

摘要

当前文本到图像的扩散生成通常采用完整文本条件输入。由于复杂语法结构，扩散变换器（DiTs）固有地存在对完整文本描述的理解缺陷：一次性完整文本输入要么忽略关键语义细节，要么因同时建模多种语义基元类型而导致语义混淆。为缓解DiTs的这一缺陷，我们提出名为DiT-ST的新型拆分文本条件框架。该框架将完整文本描述转换为由简化句子组成的拆分文本描述，以显式表达各类语义基元及其相互关系。随后通过分层渐进方式，将拆分文本描述注入DiT-ST的不同去噪阶段。具体而言，DiT-ST利用大语言模型解析描述文本，提取多样化语义基元，并分层梳理构建为拆分文本输入。此外，我们根据扩散去噪过程对不同语义基元类型的差异敏感性进行阶段划分，确定合适时间步，通过交叉注意力机制将各类语义基元的标记逐步注入输入标记。这种方法增强了DiT-ST在不同阶段对特定语义基元类型的表征学习能力。大量实验验证了所提DiT-ST在缓解完整文本理解缺陷方面的有效性。

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Abstract

arXiv:2505.19293v1 Announce Type: cross Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

摘要

长上下文能力被视为大语言模型（LLM）最重要的能力之一，因为真正具备长上下文处理能力的LLM能让用户轻松完成许多原本繁琐的任务——例如通过消化长文本文档来寻找答案，而非直接向LLM提问。然而，现有基于真实任务的长上下文评估基准存在两大缺陷。首先，像LongBench这样的基准通常无法提供合适的指标来区分长上下文性能与模型的基线能力，导致跨模型比较不够清晰。其次，这类基准通常以固定输入长度构建，限制了其在不同模型间的适用性，且无法揭示模型何时开始失效。为解决这些问题，我们提出了一个长度可调的长上下文基准和一个新颖的评估指标，该指标能将基线知识与真实的长上下文能力分离。实验证明我们的方法在有效评估LLM方面具有优越性。

A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations

Abstract

arXiv:2505.19299v1 Announce Type: cross Abstract: Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.

摘要

可靠的自由文本解释对于确保高风险人工智能决策场景的透明度至关重要，但语言模型生成这类解释以及人工评估均存在挑战。本文通过扩展证据权重概念，提出了一种预测-解释（PEX）一致性度量方法。该指标量化了自由文本解释对预测的支持或反对程度，是解释可信度的重要维度。分析表明，超过62%由大语言模型生成的解释缺乏这种一致性。研究发现，采用直接偏好优化方法可提升三个模型族生成解释的一致性，改进幅度介于43.1%至292.3%之间。此外，优化该一致性指标能使解释可信度最高提升9.7%。

Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking

Abstract

arXiv:2505.19310v1 Announce Type: cross Abstract: Integrating multiple (sub-)systems is essential to create advanced Information Systems. Difficulties mainly arise when integrating dynamic environments, e.g., the integration at design time of not yet existing services. This has been traditionally addressed using a registry that provides the API documentation of the endpoints. Large Language Models have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input oken limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. In the present work, we (i) analyze the usage of Retrieval Augmented Generation for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input oken length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints nd retrieves specification details on demand. We evaluate RAG for endpoint discovery using (iii) a proposed novel service discovery benchmark SOCBench-D representing a general setting across numerous domains and the real-world RestBench enchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval accuracy. Then, we assess the Discovery Agent using the same test data set. The prototype shows how to successfully employ RAG for endpoint discovery to reduce the token count. Our experiments show that endpoint-based approaches outperform naive chunking methods for preprocessing. Relying on an agent significantly improves precision while being prone to decrease recall, disclosing the need for further reasoning capabilities.

摘要

整合多个（子）系统对于构建高级信息系统至关重要。当涉及动态环境集成时（例如在设计阶段集成尚未存在的服务），主要困难随之产生。传统解决方案依赖于提供终端API文档的注册中心。大型语言模型已展现出基于此类文档自动实现系统集成（如服务组合）的能力，但由于输入令牌限制（尤其针对综合性API描述），需要精简的输入内容。目前，关于如何最优预处理这些API描述尚未形成共识。本研究（i）分析了检索增强生成技术在终端发现中的应用，以及对现行OpenAPI进行分块预处理的方法，旨在缩减输入令牌长度的同时保留最关键信息。为进一步减少组合提示的输入令牌长度并提升终端检索效率，我们提出（ii）发现代理机制——该代理仅接收最相关终端的摘要，并按需获取详细规范说明。我们通过（iii）新开发的多领域通用服务发现基准SOCBench-D和真实场景RestBench基准，首先针对不同分块方案及参数评估终端检索准确率，继而使用相同测试数据集验证发现代理性能。原型系统证明了检索增强生成技术能有效降低终端发现的令牌消耗。实验表明：基于终端的预处理方法优于简单分块策略，而代理机制在显著提升精度的同时可能降低召回率，这揭示了对进一步推理能力的需求。

Communication-Efficient Multi-Device Inference Acceleration for Transformer Models

Abstract

arXiv:2505.19342v1 Announce Type: cross Abstract: Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at https://github.com/xl1990/Astra.

摘要

Transformer模型虽驱动众多AI应用，但其高推理延迟限制了实时场景下的使用。多设备推理可通过并行计算降低延迟，然而现有方法需要高设备间带宽，在带宽受限环境中难以实用。我们提出ASTRA框架，该通信高效方案通过序列并行与混合精度注意力机制创新性结合来加速Transformer推理，旨在最小化设备间通信。ASTRA采用向量量化压缩非局部令牌嵌入，并通过噪声增强量化和分布式类别令牌两项优化保持任务精度。在视觉与NLP任务中对ViT和GPT2的实验表明，ASTRA相比单设备推理最高实现2.64倍加速，较先进多设备推理方案最高达15.25倍加速，且可在低至10 Mbps带宽下运行。ASTRA已开源：https://github.com/xl1990/Astra。

Simple and Effective Baselines for Code Summarisation Evaluation

Abstract

arXiv:2505.19392v1 Announce Type: cross Abstract: Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.

摘要

代码文档具有实用价值，但编写过程耗时。尽管已出现多种生成代码摘要的技术，但由于人工评估成本高昂且自动度量指标不可靠，比较这些技术存在困难。本文提出一种简单的新基线方法：通过大型语言模型（LLM）对摘要进行整体评分。与基于n-gram和嵌入的基线方法不同，我们的方法能在评分时考虑代码本身。这使得我们可以构建完全不参考原始摘要的变体，该方法还可应用于其他任务（例如评估代码库中文档的质量）。研究发现，尽管建议与基于嵌入的方法结合使用以避免LLM特定偏差的风险，但本方法优于或等同于现有度量指标。

It's Not Just Labeling" -- A Research on LLM Generated Feedback Interpretability and Image Labeling Sketch Features

Abstract

arXiv:2505.19419v1 Announce Type: cross Abstract: The quality of training data is critical to the performance of machine learning applications in domains like transportation, healthcare, and robotics. Accurate image labeling, however, often relies on time-consuming, expert-driven methods with limited feedback. This research introduces a sketch-based annotation approach supported by large language models (LLMs) to reduce technical barriers and enhance accessibility. Using a synthetic dataset, we examine how sketch recognition features relate to LLM feedback metrics, aiming to improve the reliability and interpretability of LLM-assisted labeling. We also explore how prompting strategies and sketch variations influence feedback quality. Our main contribution is a sketch-based virtual assistant that simplifies annotation for non-experts and advances LLM-driven labeling tools in terms of scalability, accessibility, and explainability.

摘要

训练数据的质量对机器学习在交通、医疗和机器人等领域的应用性能至关重要。然而，准确的图像标注通常依赖于耗时且反馈有限的专家驱动方法。本研究提出了一种基于草图标注的方法，并借助大语言模型（LLMs）降低技术门槛、提升可及性。通过使用合成数据集，我们探究了草图识别特征与大语言模型反馈指标之间的关联，旨在提高LLM辅助标注的可靠性和可解释性。同时，我们还研究了提示策略和草图变体对反馈质量的影响。我们的主要贡献是开发了一个基于草图的虚拟助手，该工具不仅简化了非专业人士的标注流程，还在可扩展性、可访问性和可解释性方面推动了LLM驱动的标注工具发展。

Alignment of large language models with constrained learning

Abstract

arXiv:2505.19387v1 Announce Type: cross Abstract: We study the problem of computing an optimal large language model (LLM) policy for a constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF dataset.

摘要

我们研究如何为受限对齐问题计算最优大语言模型（LLM）策略，其目标是在满足次要效用约束条件下最大化主要奖励目标。尽管基于拉格朗日方法的LLM策略搜索在受限对齐中广受欢迎，但迭代的原始对偶方法常无法收敛，而非迭代的对偶方法在LLM参数空间中无法达到最优性。为解决这些挑战，我们运用拉格朗日对偶性开发了一种迭代对偶对齐方法，该方法通过在拉格朗日最大化更新LLM策略与对偶下降更新对偶变量之间交替进行。理论上，我们刻画了分布空间中的原始值与LLM参数空间中对偶值之间的原始对偶间隙。我们进一步量化了在近优对偶变量下所学LLM策略关于目标函数和约束函数的最优性间隙。这些结果证明对偶对齐方法能找到最优受限LLM策略（直至LLM参数化间隙）。通过在PKU-SafeRLHF数据集上的大量实验，我们验证了所提方法的有效性和优势。

PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims

Abstract

arXiv:2505.19345v1 Announce Type: cross Abstract: Natural language generation (NLG) metrics play a central role in evaluating generated texts, but are not well suited for the structural and legal characteristics of patent documents. Large language models (LLMs) offer strong potential in automating patent generation, yet research on evaluating LLM-generated patents remains limited, especially in evaluating the generation quality of patent claims, which are central to defining the scope of protection. Effective claim evaluation requires addressing legal validity, technical accuracy, and structural compliance. To address this gap, we introduce PatentScore, a multi-dimensional evaluation framework for assessing LLM-generated patent claims. PatentScore incorporates: (1) hierarchical decomposition for claim analysis; (2) domain-specific validation patterns based on legal and technical standards; and (3) scoring across structural, semantic, and legal dimensions. Unlike general-purpose NLG metrics, PatentScore reflects patent-specific constraints and document structures, enabling evaluation beyond surface similarity. We evaluate 400 GPT-4o-mini generated Claim 1s and report a Pearson correlation of $r = 0.819$ with expert annotations, outperforming existing NLG metrics. Furthermore, we conduct additional evaluations using open models such as Claude-3.5-Haiku and Gemini-1.5-flash, all of which show strong correlations with expert judgments, confirming the robustness and generalizability of our framework.

摘要

自然语言生成（NLG）指标在评估生成文本方面发挥着核心作用，但并不适合专利文件的结构和法律特征。大语言模型（LLM）在自动化专利生成方面展现出强大潜力，然而针对LLM生成专利的评估研究仍然有限，特别是在评估权利要求书的生成质量方面——这是界定保护范围的核心要素。有效的权利要求评估需要解决法律有效性、技术准确性和结构合规性等问题。为填补这一空白，我们提出了PatentScore，一个用于评估LLM生成专利权利要求的多维框架。PatentScore包含：（1）权利要求分析的层次化分解；（2）基于法律和技术标准的领域特定验证模式；（3）结构、语义和法律维度的评分体系。与通用NLG指标不同，PatentScore反映了专利特有的约束条件和文档结构，能够实现超越表面相似性的深度评估。我们评估了400份GPT-4o-mini生成的权利要求1，报告其与专家标注的Pearson相关性达 $r=0.819$ ，优于现有NLG指标。此外，我们还使用Claude-3.5-Haiku和Gemini-1.5-flash等开源模型进行了补充评估，所有结果均显示与专家判断具有强相关性，证实了我们框架的稳健性和普适性。

The Role of Diversity in In-Context Learning for Large Language Models

Abstract

arXiv:2505.19426v1 Announce Type: cross Abstract: In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.

摘要

上下文学习（ICL）是当前大语言模型（LLMs）的核心能力，其中示例的选择对性能至关重要。虽然现有方法大多侧重于选择与查询最相似的示例，但示例选择多样性的影响仍未得到充分探索。我们通过一系列实验（从情感分类到更具挑战性的数学和代码问题）系统研究了多样性在上下文示例选择中的作用。基于Llama-3.1、Gemma-2和Mistral-v0.3系列模型的实验表明，考虑多样性的选择方法能提升性能（尤其在数学和代码等复杂任务上），并增强对分布外查询的鲁棒性。为支持这些发现，我们提出了一个理论框架，用以解释在上下文示例选择中引入多样性的优势。

Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

Abstract

arXiv:2505.19430v1 Announce Type: cross Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.

摘要

反事实推理通常涉及对实际事件的替代性考量。尽管该方法常用于理解过去事件，但一种独特的形式——前瞻性反事实推理——专注于预测未来可能的发展态势。这种推理方式在动态变化的金融市场中具有重要价值，通过预判市场走势能有效揭示利益相关者面临的潜在风险与机遇，从而指导决策。然而由于认知负荷的限制，大规模实施此类推理具有挑战性，这凸显了对自动化解决方案的需求。虽然大型语言模型（LLMs）展现出应用潜力，但在此领域的探索仍属空白。为填补这一缺口，我们提出了创新性基准Fin-Force（金融前瞻反事实评估），通过精选金融新闻标题并提供结构化评估框架，支持基于LLM的前瞻性反事实生成。这为开发可扩展的自动化解决方案以探索和预测未来市场发展铺平了道路，从而为决策提供结构化洞见。通过在Fin-Force上的实验，我们评估了前沿LLM及反事实生成方法的性能，分析了其局限性，并为未来研究提出了建设性见解。

Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Abstract

arXiv:2505.19443v1 Announce Type: cross Abstract: This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.

摘要

本综述对AI辅助软件开发中的两大新兴范式——氛围编码与代理编码——进行了全面分析。尽管二者均依托大语言模型（LLMs），但在自主性、架构设计和开发者角色方面存在本质差异。氛围编码通过基于提示的对话式工作流，强调直觉性的人机协同交互，支持创意构思、实验验证和创造性探索；而代理编码则通过目标驱动的自主代理实现软件开发，这些代理能够以最小人工干预完成规划、执行、测试及迭代任务。我们提出了涵盖概念基础、执行模型、反馈机制、安全防护、调试策略及现实工具生态的详细分类体系。通过对比工作流分析和20个详细用例，阐明氛围系统在早期原型设计及教育领域表现突出，而代理系统更擅长企业级自动化、代码库重构及CI/CD集成。进一步探讨了混合架构的新兴趋势，即自然语言界面与自主执行管道的结合。最后提出了代理式AI的发展路线图，概述构建可信、可解释、可协作系统所需的基础设施。研究表明，成功的AI软件工程并非要选择单一范式，而是要在以人为本的统一开发生命周期中协调二者的优势。

VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation

Abstract

arXiv:2505.19395v1 Announce Type: cross Abstract: Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r > 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER's comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: https://github.com/AfterQuery/vader

摘要

确保大语言模型（LLMs）能够有效评估、检测、解释和修复软件漏洞，对于构建健壮且安全的软件系统至关重要。我们提出了VADER，这是一个经过人工评估的基准测试，专门用于评估LLMs在漏洞处理的四个关键维度上的表现：评估、检测、解释和修复。VADER包含174个真实世界的软件漏洞，每个漏洞均从GitHub仓库中精心筛选并由安全专家标注。对于每个漏洞案例，模型需要识别缺陷、使用通用缺陷枚举（CWE）进行分类、解释其根本原因、提出修复补丁并制定测试计划。通过一次性提示策略，我们对六种最先进的LLMs（Claude 3.7 Sonnet、Gemini 2.5 Pro、GPT-4.1、GPT-4.5、Grok 3 Beta和o3）在VADER上进行了基准测试，并由安全专家根据严格的评分标准对每个回答进行评估，重点关注修复（代码修复质量，50%）、解释（20%）以及分类和测试计划（30%）。我们的结果表明，当前最先进的LLMs在VADER上仅取得中等成功——OpenAI的o3总体准确率为54.7%，其他模型在49%-54%之间，表明仍有较大改进空间。值得注意的是，修复质量与准确的分类和测试计划呈强相关性（Pearson r > 0.97），这表明能够有效分类漏洞的模型也往往能较好地修复它们。VADER的完整数据集、详细评估标准、评分工具以及带有置信区间的可视化结果已公开发布，为社区提供了一个可解释、可复现的基准，以推动漏洞感知LLMs的发展。所有代码和数据均可在以下网址获取：https://github.com/AfterQuery/vader

SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback

Abstract

arXiv:2505.19514v1 Announce Type: cross Abstract: Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.

摘要

提示词质量对大型语言模型（LLMs）的性能具有决定性影响，这促使了越来越多关于提示优化的研究。现有方法大多基于固定数据集进行提示优化，假设输入分布静态且缺乏对迭代改进的支持。我们提出SIPDO（基于数据增强优化的自改进提示框架），这是一种将合成数据生成整合至优化过程的闭环提示学习框架。SIPDO将合成数据生成器与提示优化器耦合：生成器通过暴露当前提示缺陷产生新样本，优化器则据此逐步优化提示。这种反馈驱动机制能在不依赖外部监督或新任务的前提下，实现提示性能的系统性提升。在问答和推理基准测试中的实验表明，SIPDO优于标准提示调优方法，验证了将数据合成融入提示学习工作流程的价值。

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

Abstract

arXiv:2505.19481v1 Announce Type: cross Abstract: Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.

摘要

大语言模型（LLMs）在多样化的推理与生成任务中展现出卓越性能，并日益作为智能体部署于代码生成和推荐系统等动态环境中。然而，高频交易和实时竞技游戏等现实应用场景需要严格延迟约束下的决策能力，其中更快的响应速度直接转化为更高收益。尽管这种延迟与质量的权衡至关重要，但在基于LLM的智能体研究中仍未被充分探索。本研究首次对实时决策任务中的这种权衡进行了系统性分析。为支持研究，我们引入两个新基准：HFTBench（高频交易模拟器）和StreetFighter（竞技游戏平台）。分析表明，最优延迟-质量平衡因任务而异，而牺牲质量换取更低延迟能显著提升下游性能。为此，我们提出FPX框架——通过实时需求动态选择模型规模和量化级别的自适应系统。该方法在两个基准测试中均取得最佳表现：在《街头霸王》中获胜率最高提升80%，在交易场景中日收益率最高提升26.52%，这凸显了基于LLM的智能体需要延迟感知的评估与部署策略。研究结果证实了延迟敏感评估框架对现实世界LLM智能体的关键价值。相关基准测试已发布于Latency Sensitive Benchmarks平台。

CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation

Abstract

arXiv:2505.19502v1 Announce Type: cross Abstract: Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and the generated code. To systematically understand the landscape of these LLM-as-Judge methods, we conduct a comprehensive empirical study across three diverse datasets. Our investigation reveals the pros and cons of two categories of LLM-as-Judge methods: the methods based on general foundation models can achieve good performance but require complex prompts and lack explainability, while the methods based on reasoning foundation models provide better explainability with simpler prompts but demand substantial computational resources due to their large parameter sizes. To address these limitations, we propose CODE-DITING, a novel code evaluation method that balances accuracy, efficiency and explainability. We develop a data distillation framework that effectively transfers reasoning capabilities from DeepSeek-R1671B to our CODE-DITING 1.5B and 7B models, significantly enhancing evaluation explainability and reducing the computational cost. With the majority vote strategy in the inference process, CODE-DITING 1.5B outperforms all models with the same magnitude of parameters and achieves performance which would normally exhibit in a model with 5 times of parameter scale. CODE-DITING 7B surpasses GPT-4o and DeepSeek-V3 671B, even though it only uses 1% of the parameter volume of these large models. Further experiments show that CODEDITING is robust to preference leakage and can serve as a promising alternative for code evaluation.

摘要

代码片段的可信评估方法在神经代码生成中起着关键作用。传统方法要么依赖参考解决方案，要么需要可执行测试用例，在灵活性和可扩展性方面存在固有局限。新兴的LLM-as-Judge方法通过直接评估问题描述与生成代码之间的功能一致性，提供了有前景的替代方案。为系统理解这类方法的现状，我们在三个不同数据集上开展了全面实证研究。研究发现两类LLM-as-Judge方法的优缺点：基于通用基础模型的方法虽能取得良好性能，但需要复杂提示且缺乏可解释性；而基于推理基础模型的方法通过简单提示即可提供更好可解释性，但由于参数量庞大需要大量计算资源。针对这些局限，我们提出CODE-DITING这一新型代码评估方法，在准确性、效率和可解释性之间实现平衡。我们开发的数据蒸馏框架有效将DeepSeek-R1671B的推理能力迁移至CODE-DITING 1.5B和7B模型，显著提升评估可解释性并降低计算成本。通过推理过程中的多数投票策略，CODE-DITING 1.5B在同等参数规模模型中表现最优，达到通常需要5倍参数规模才能实现的性能。CODE-DITING 7B虽仅使用这些大模型1%的参数体量，却超越了GPT-4o和DeepSeek-V3 671B。进一步实验表明CODEDITING对偏好泄露具有鲁棒性，可作为代码评估的理想替代方案。

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Abstract

arXiv:2505.19504v1 Announce Type: cross Abstract: Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.

摘要

大型语言模型（LLMs）作为重大智力与经济投入的成果，其高效性可能无意中通过知识蒸馏（KD）促进模型模仿。在实际场景中，竞争对手仅需观察公开可获取的输出即可蒸馏专有LLM的能力，这类似于仅通过观察来逆向工程复杂表演。现有保护方法（如数字水印）仅能事后识别模仿行为，而其他防御措施则假设学生模型会复制教师模型的内部逻辑值，导致这些方法对纯基于输出文本的蒸馏完全无效。本文针对基于API访问的现实约束条件下主动保护LLMs的挑战，提出了一种高效防御性输出生成（DOGe）策略。该策略通过微妙调整LLM的输出行为，在保证合法用户获得准确有用结果的同时，使输出内容对蒸馏过程具有误导性，从而显著破坏模仿尝试。我们仅通过对抗性损失微调教师LLM的最终线性层实现这一目标，这种针对性训练方法能在推理阶段预判并干扰蒸馏尝试。实验表明：在保持甚至提升教师模型原始性能的同时，从防御性生成的教师输出中蒸馏得到的学生模型性能出现灾难性下降，这证明我们的方法能有效防范基于KD的模型模仿。

Hierarchical Tree Search-based User Lifelong Behavior Modeling on Large Language Model

Abstract

arXiv:2505.19505v1 Announce Type: cross Abstract: Large Language Models (LLMs) have garnered significant attention in Recommendation Systems (RS) due to their extensive world knowledge and robust reasoning capabilities. However, a critical challenge lies in enabling LLMs to effectively comprehend and extract insights from massive user behaviors. Current approaches that directly leverage LLMs for user interest learning face limitations in handling long sequential behaviors, effectively extracting interest, and applying interest in practical scenarios. To address these issues, we propose a Hierarchical Tree Search-based User Lifelong Behavior Modeling framework (HiT-LBM). HiT-LBM integrates Chunked User Behavior Extraction (CUBE) and Hierarchical Tree Search for Interest (HTS) to capture diverse interests and interest evolution of user. CUBE divides user lifelong behaviors into multiple chunks and learns the interest and interest evolution within each chunk in a cascading manner. HTS generates candidate interests through hierarchical expansion and searches for the optimal interest with process rating model to ensure information gain for each behavior chunk. Additionally, we design Temporal-Ware Interest Fusion (TIF) to integrate interests from multiple behavior chunks, constructing a comprehensive representation of user lifelong interests. The representation can be embedded into any recommendation model to enhance performance. Extensive experiments demonstrate the effectiveness of our approach, showing that it surpasses state-of-the-art methods.

摘要

大型语言模型（LLMs）凭借其丰富的世界知识和强大的推理能力，在推荐系统（RS）领域获得了广泛关注。然而，如何使LLMs有效理解并提取海量用户行为中的洞察仍面临关键挑战。现有直接利用LLMs进行用户兴趣学习的方法在处理长序列行为、有效提取兴趣及实际场景应用方面存在局限。为此，我们提出基于层次化树搜索的用户终身行为建模框架（HiT-LBM）。该框架通过分块用户行为提取（CUBE）和层次化兴趣树搜索（HTS）来捕捉用户多样化兴趣及其演化过程。CUBE将用户终身行为划分为多个区块，以级联方式学习每个区块内的兴趣及兴趣演化。HTS通过层次化扩展生成候选兴趣，并利用过程评分模型搜索最优兴趣，确保每个行为区块的信息增益。此外，我们设计时序感知兴趣融合模块（TIF）来整合多行为区块的兴趣，构建用户终身兴趣的完整表征。该表征可嵌入任意推荐模型以提升性能。大量实验证明本方法的有效性，其表现优于当前最先进方法。

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

Abstract

arXiv:2505.19509v1 Announce Type: cross Abstract: Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at https://github.com/MLLMKCBENCH/MLLMKC.

摘要

大型多模态模型(LMMs)在面临多模态知识冲突时存在显著挑战，特别是在检索增强生成(RAG)框架下，外部来源的上下文信息可能与模型内部参数化知识相矛盾，导致输出结果不可靠。然而现有基准测试未能反映此类现实冲突场景：多数研究仅关注内部记忆冲突，而上下文-记忆冲突与跨上下文冲突领域仍缺乏深入探究。此外，基于事实知识的评估方法常被忽视，现有数据集对冲突检测能力的考察也不够全面。为填补这一空白，我们提出MMKC-Bench基准测试，专门用于评估上下文-记忆和跨上下文场景中的事实知识冲突。该基准涵盖三类多模态知识冲突，包含通过自动化流程采集并经人工校验的1,573个知识实例和3,381张图像，涉及23个广泛类别。我们对三个代表性LMM系列进行了模型行为分析和冲突检测任务评估。研究发现，尽管当前LMMs能够识别知识冲突，但往往更倾向于依赖内部参数化知识而非外部证据。期望MMKC-Bench能促进多模态知识冲突研究的深入，并推动多模态RAG系统的发展。

DocMEdit: Towards Document-Level Model Editing

Abstract

arXiv:2505.19572v1 Announce Type: cross Abstract: Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document-level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce \benchmarkname, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.

摘要

模型编辑旨在以最小成本修正大语言模型（LLMs）中的错误和过时知识。先前研究提出了多种数据集以评估这些模型编辑方法的有效性。然而，现有数据集大多仅要求模型输出短短语或句子，忽视了现实世界中广泛存在的文档级任务，这引发了对其实际适用性的质疑。为解决这一局限并推动模型编辑在现实场景中的应用，我们提出了文档级模型编辑任务。为应对此类挑战并增强模型在实际环境中的能力，我们引入了\benchmarkname数据集，该数据集专注于文档级模型编辑，其特点包括文档级输入输出、外推性以及单次编辑中包含多重事实。我们提出了一系列评估指标和实验，结果表明文档级模型编辑的难度对现有模型编辑方法构成了挑战。

How Syntax Specialization Emerges in Language Models

Abstract

arXiv:2505.19548v1 Announce Type: cross Abstract: Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a 'critical period' of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance.

摘要

研究发现大型语言模型（LLMs）会形成令人惊奇的内部专化现象：单个神经元、注意力头和电路会选择性对句法结构产生敏感反应，这种现象与人类大脑中观察到的模式相呼应。尽管这种专化已有充分记录，但其在训练过程中如何形成以及受哪些因素影响仍属未知领域。本研究通过追踪专化现象的时序形成过程，揭示了其黑箱机制。通过量化不同句法现象最小对比对中的内部句法一致性，我们发现了明确的发展轨迹：句法敏感性逐步显现，集中分布于特定层级，并呈现出一个快速内部专化的"关键期"。该过程在不同架构和初始化参数（如随机种子）中表现一致，同时受模型规模与训练数据影响。因此，我们不仅揭示了句法在LLMs中的形成位置，还阐明了部分模型在训练过程中内化句法的机制。为支持后续研究，我们将在论文录用后公开相关代码、模型及训练检查点。

Abstract

arXiv:2505.19578v1 Announce Type: cross Abstract: Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.

摘要

稀疏注意力方法利用注意力机制固有的稀疏性来加速长上下文推理的预填充阶段，从而缓解全注意力计算的二次复杂度问题。现有稀疏注意力方法依赖预定义模式或不精确估计来近似注意力行为，往往无法完整捕捉注意力的真实动态，导致效率降低和准确性受损。我们提出了一种高精度稀疏注意力机制，通过在注意力头间共享相似但精确的注意力模式，更真实地捕捉注意力的动态行为。该方法基于两个关键发现：(1) 注意力模式表现出强烈的头间相似性；(2) 这种相似性在不同输入间保持高度一致。通过策略性地在注意力头间共享计算得到的精确模式，我们的方法能有效捕获实际模式，同时仅需对少量头进行全注意力计算。综合评估表明，相较于最先进方法，本方案在取得相当或更优加速比的同时，提供了最佳的整体准确性。

Multi-Agent Collaboration via Evolving Orchestration

Abstract

arXiv:2505.19591v1 Announce Type: cross Abstract: Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution.

摘要

大语言模型（LLMs）在各类下游任务中取得了显著成果，但其单一性限制了复杂问题解决中的可扩展性和效率。尽管近期研究探索了LLMs间的多智能体协作，但多数方法依赖于静态组织结构，难以随任务复杂度和智能体数量增长而自适应调整，导致协调开销与效率低下。为此，我们提出一种基于LLM的木偶式多智能体协作范式，其中中央协调器（"操纵者"）能根据动态任务状态实时调度智能体（"木偶"）。该协调器通过强化学习训练，可自适应地排序和优先调用智能体，实现灵活可进化的集体推理。在封闭域和开放域场景中的实验表明，该方法能以更低计算成本获得更优性能。分析进一步揭示，关键改进始终源于协调器演化过程中涌现出的更紧凑、循环式推理结构。

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Abstract

arXiv:2505.19536v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut

摘要

大型视觉语言模型（LVLMs）在多模态理解方面表现卓越，但由于冗余的视觉标记导致计算成本高昂。现有的剪枝方法通常依赖单层注意力分数来排序和剪枝冗余视觉标记以解决这一效率问题。然而，由于标记与层之间的交互复杂，这引发了一个基本问题：如此简单的单层标准是否足以识别冗余？为回答这一问题，我们从信息流这一基础视角重新思考冗余视觉标记的产生：信息流通过捕捉标记在层间的信息传递方式，建模了标记与层之间的交互。我们发现（1）CLS标记作为信息中继，可简化复杂的信息流分析；（2）冗余通过逐层注意力集中而动态渐进地显现；（3）仅依赖单层注意力分数可能导致矛盾的冗余识别。基于此，我们提出FlowCut，一种信息流感知的剪枝框架，缓解当前标准在识别冗余标记上的不足，并更好地与模型的固有行为对齐。大量实验表明，FlowCut取得了优异的结果，在LLaVA-1.5-7B上以88.9%的标记削减率优于现有最佳方法1.6%，在LLaVA-NeXT-7B上以94.4%的削减率领先4.3%，并在预填充阶段实现了3.2倍的加速。我们的代码发布于https://github.com/TungChintao/FlowCut。

Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

Abstract

arXiv:2505.19599v1 Announce Type: cross Abstract: Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab's uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3's perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.

摘要

评估语言模型性能的典型方法主要考察其准确回答问题的能力。这类评估指标虽能总体衡量语言模型对文本的理解与推理水平，却难以捕捉细微能力差异，例如模型对罕见语法点（尤其是非英语语言）的识别与遵循能力。本研究通过测量语言模型面对日语"第一人称心理谓词限制"语法点时的困惑度，发现Weblab是7-10B参数范围内唯一始终对不合语法心理谓词句赋予更高困惑度的开源模型。证据表明，Weblab表现优异可能源于其统一的低质量分词处理，实验证明通过限制测试句为分词表现一致的句子，Llama 3对合语法心理谓词句的困惑度可降低28倍量级。在机器翻译任务的进一步实验中，我们发现当分词问题阻碍最自然句式输出时，语言模型会转而采用替代语法模式以生成合语法句子。

Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs

Abstract

arXiv:2505.19620v1 Announce Type: cross Abstract: Spatio-temporal prediction is a pivotal task with broad applications in traffic management, climate monitoring, energy scheduling, etc. However, existing methodologies often struggle to balance model expressiveness and computational efficiency, especially when scaling to large real-world datasets. To tackle these challenges, we propose STH-SepNet (Spatio-Temporal Hypergraph Separation Networks), a novel framework that decouples temporal and spatial modeling to enhance both efficiency and precision. Therein, the temporal dimension is modeled using lightweight large language models, which effectively capture low-rank temporal dynamics. Concurrently, the spatial dimension is addressed through an adaptive hypergraph neural network, which dynamically constructs hyperedges to model intricate, higher-order interactions. A carefully designed gating mechanism is integrated to seamlessly fuse temporal and spatial representations. By leveraging the fundamental principles of low-rank temporal dynamics and spatial interactions, STH-SepNet offers a pragmatic and scalable solution for spatio-temporal prediction in real-world applications. Extensive experiments on large-scale real-world datasets across multiple benchmarks demonstrate the effectiveness of STH-SepNet in boosting predictive performance while maintaining computational efficiency. This work may provide a promising lightweight framework for spatio-temporal prediction, aiming to reduce computational demands and while enhancing predictive performance. Our code is avaliable at https://github.com/SEU-WENJIA/ST-SepNet-Lightweight-LLMs-Meet-Adaptive-Hypergraphs.

摘要

时空预测是交通管理、气候监测、能源调度等领域的关键任务。然而现有方法在模型表达能力与计算效率的平衡上存在不足，尤其难以适应大规模现实数据集。为此，我们提出STH-SepNet（时空超图分离网络），通过解耦时空建模来提升效率与精度。该框架采用轻量级大语言模型捕捉低秩时间动态，同时通过自适应超图神经网络动态构建超边以建模复杂高阶空间交互，并设计门控机制实现时空表征的有机融合。基于低秩时间动态与空间交互的基本原理，STH-SepNet为实际应用提供了高效可扩展的时空预测方案。在多基准的大规模现实数据集实验中，该方法在保持计算效率的同时显著提升了预测性能。本研究为时空预测提供了一个有望降低计算成本并提升预测性能的轻量级框架。代码已开源：https://github.com/SEU-WENJIA/ST-SepNet-Lightweight-LLMs-Meet-Adaptive-Hypergraphs。

Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

Abstract

arXiv:2505.19609v1 Announce Type: cross Abstract: Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT. Through dynamic data scheduling, Skrull balances the computation requirements of long and short sequences, improving overall training efficiency. Furthermore, we formulate the scheduling process as a joint optimization problem and thoroughly analyze the trade-offs involved. Based on those analysis, Skrull employs a lightweight scheduling algorithm to achieve near-zero cost online scheduling in Long-SFT. Finally, we implement Skrull upon DeepSpeed, a state-of-the-art distributed training system for LLMs. Experimental results demonstrate that Skrull outperforms DeepSpeed by 3.76x on average (up to 7.54x) in real-world long-SFT scenarios.

摘要

长上下文监督微调（Long-SFT）对于提升大语言模型（LLM）在长上下文任务中的表现至关重要。为使LLM顺利适应长上下文场景，该过程通常需要在包含长短序列的混合数据集上进行训练。然而，这种异构序列长度分布对现有训练系统提出了重大挑战，因为它们无法同时实现长短序列的高训练效率，导致Long-SFT的端到端系统性能欠佳。本文提出一种新颖的数据调度视角，以解决Long-SFT中异构数据分布带来的挑战。我们设计了Skrull——一个专为高效长上下文微调而设计的动态数据调度器。通过动态数据调度，Skrull平衡了长短序列的计算需求，从而提升整体训练效率。此外，我们将调度过程建模为联合优化问题，并深入分析其中的权衡关系。基于这些分析，Skrull采用轻量级调度算法，在Long-SFT中实现近乎零成本的在线调度。最后，我们在最先进的LLM分布式训练系统DeepSpeed上实现了Skrull。实验结果表明，在实际长上下文微调场景中，Skrull平均性能超越DeepSpeed 3.76倍（最高达7.54倍）。

Preference Optimization by Estimating the Ratio of the Data Distribution

Abstract

arXiv:2505.19601v1 Announce Type: cross Abstract: Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$ -PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu's power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$ -DPO or $f$ -PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-Instruct-8B, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9% length-controlled win rate on AlpacaEval2.

摘要

直接偏好优化（DPO）作为一种简单稳定的方法，被广泛用于将大语言模型（LLM）与人类偏好对齐。本文从似然比估计的角度出发，研究了一种广义DPO损失函数，使策略模型能够匹配目标策略。目标策略的比率提供了策略分布的唯一标识，无需依赖奖励模型或配分函数。这使得广义损失既能保持简洁性，又具有理论保证，而先前工作如 $f$ -PO无法同时实现这两点。我们提出Bregman偏好优化（BPO），这是一个用于比率匹配的广义框架，提供了一组实现目标策略最优性的目标函数族。BPO将DPO作为特例包含其中，并为所有实例提供了易处理的形式，仅需几行代码即可实现。我们进一步开发了缩放Basu幂散度（SBA），这是一种可用于BPO实例的梯度缩放方法。BPO框架与其他DPO变体互补，并适用于由这些变体定义的目标策略。实验表明，与 $f$ -DPO或 $f$ -PO等概率损失扩展不同（这些方法在生成保真度与多样性之间存在权衡），BPO实例在胜率和熵两方面均优于DPO。当应用于Llama-3-Instruct-8B时，BPO在Llama-3-8B骨干模型中实现了最先进的性能，在AlpacaEval2上达到55.9%的长度控制胜率。

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Abstract

arXiv:2505.19616v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model's inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method's effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.

摘要

多模态大语言模型（MLLMs）已在各类任务中展现出卓越能力，但其常难以区分任务相关与无关信号，尤其在视觉问答（VQA）等任务中易受误导性或伪相关输入的干扰。我们将这一广义局限称为跨模态能力问题：模型无法公平评估所有模态。该缺陷在图像分类或纯文本问答等单模态任务中更为显著——此类任务本需模型仅依赖单一模态，而无关模态的干扰信息常导致性能显著下降。我们将此失效现象定义为模态干扰，其作为跨模态能力问题的具体可量化实例。我们进一步设计了基于扰动的因果诊断实验以验证和量化该问题。为缓解模态干扰，提出新型MLLMs微调框架：包含基于投影梯度下降（PGD）的启发式扰动与对抗扰动的数据增强方法，以及对原始输入与扰动输入采用输出一致性正则化策略。在多个基准数据集（图像主导型、文本主导型及VQA任务）和不同规模模型族上的实验表明，该方法能显著提升模型鲁棒性与跨模态能力，证实其既可增强单模态推理能力，又能提升多模态任务性能的有效性。

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Abstract

arXiv:2505.19623v1 Announce Type: cross Abstract: The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing and studying agentic recommender systems; and (3) the first comprehensive benchmark comparing 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a continuously maintained leaderboard~\footnote[2]{https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html}, fostering ongoing community engagement and reproducible research. The benchmark is available at: \hyperlink{https://huggingface.co/datasets/SGJQovo/AgentRecBench}{https://huggingface.co/datasets/SGJQovo/AgentRecBench}.

摘要

基于大语言模型（LLM）的智能推荐系统标志着个性化推荐领域的范式转变，其通过LLM的高级推理与角色扮演能力实现自主、自适应的决策机制。与传统推荐方法不同，智能推荐系统能够动态收集并解析复杂环境中的用户-项目交互数据，生成具有跨场景泛化能力的鲁棒推荐策略。然而，该领域目前缺乏系统评估这些方法的标准化协议。为填补这一关键空白，我们提出：（1）集成丰富用户与项目元数据的交互式文本推荐模拟器，包含三种典型评估场景（经典推荐任务、兴趣演化任务和冷启动任务）；（2）用于开发和研究智能推荐系统的统一模块化框架；（3）首个全面对比10种经典方法与智能推荐方法的基准测试。研究结果验证了智能系统的优越性，并为其核心组件制定了可操作的设计准则。该基准环境已通过公开挑战赛严格验证，并保持公开可访问的持续维护排行榜，以促进学界持续参与和可重复研究。基准测试地址详见：https://huggingface.co/datasets/SGJQovo/AgentRecBench。

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Abstract

arXiv:2505.19631v1 Announce Type: cross Abstract: Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ( $\textbf{L}$ arge $\textbf{L}$ anguage Model-Inspired $\textbf{A}$ ho- $\textbf{C}$ orasick $\textbf{A}$ utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$ -gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA

摘要

分词是自然语言处理（NLP）的基石任务。基于"先理解，后切分"的理念，我们提出一个新框架来探索大语言模型（LLMs）在无监督分词任务中的性能极限，并通过分词任务评估LLMs的语义理解能力。我们采用当前主流LLMs在多种语言上进行分词实验以评估其"理解"能力。研究发现，LLMs能够遵循简单指令将原始文本切分为词语，且存在参数量越大的模型在多语言任务中表现越优的趋势。此外，我们提出了一种创新的无监督方法LLACA（大语言模型启发的Aho-Corasick自动机），该方法巧妙结合了Aho-Corasick自动机的高效模式识别能力和预训练LLMs的深层语义理解优势。LLACA不仅能构建基于上下文动态调整的n-gram模型，还融合了LLMs的细粒度语义理解，相较传统方法实现了显著提升。项目源代码已开源：https://github.com/hkr04/LLACA

Large Language Models in Code Co-generation for Safe Autonomous Vehicles

Abstract

arXiv:2505.19658v1 Announce Type: cross Abstract: Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems' development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.

摘要

各工业领域的软件工程师已开始使用大语言模型（LLM）来加速软件系统部分模块的实现过程。在考虑将其应用于汽车领域的ADAS或AD系统时，需要系统评估这一新方案：由于LLM的随机性特性，其在安全相关系统开发中存在一系列明确记录的风险。为降低代码审查人员评估LLM生成代码的工作量，我们提出了一种用于执行生成代码完整性检查的评估流程。本研究对比了六种前沿LLM（CodeLlama、CodeGemma、DeepSeek-r1、DeepSeek-Coders、Mistral和GPT-4）在四项安全相关编程任务中的表现。此外，我们通过定性分析这些LLM生成的最常见错误，建立了故障模式分类目录以支持人工审查。最后，本文探讨了LLM在代码生成方面的局限性与能力，以及所提评估流程在现有开发过程中的应用价值。

Automated evaluation of children's speech fluency for low-resource languages

Abstract

arXiv:2505.19671v1 Announce Type: cross Abstract: Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.

摘要

在教育领域，针对主流语言的儿童口语流畅度评估已有深入研究，但在资源匮乏语言中仍面临巨大挑战。本文提出一种自动评估系统，通过结合微调的多语言自动语音识别（ASR）模型、客观指标提取阶段以及生成式预训练变换器（GPT）网络来实现流畅度评估。客观指标包括音素错误率、单词错误率、语速及语音停顿时长比。这些指标由基于GPT的分类器进行解释，该分类器通过少量人工评估的真实样例进行引导，最终输出流畅度评分。我们在两种低资源语言（泰米尔语和马来语）的儿童语音数据集上评估了所提系统，并将分类性能与随机森林、XGBoost以及直接使用ChatGPT-4o从语音输入预测流畅度的方法进行对比。结果表明，所提方法比多模态GPT或其他方法实现了显著更高的准确率。

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Abstract

arXiv:2505.19645v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

摘要

大型语言模型（LLMs）已在众多应用中取得显著成功，其中混合专家（MoE）模型展现出巨大潜力。与传统密集模型相比，MoE能以更少计算量实现更优性能。推测解码（SD）作为无需牺牲精度的LLM推理加速技术被广泛采用，但此前仅被认为对密集模型有效。本研究首先揭示：在中等批量大小下，MoE从SD中获得的加速效益意外优于密集模型；且随着MoE稀疏化程度提升（当前设计的主流趋势），SD加速有效的批量大小范围将进一步扩大。为量化分析SD的权衡机制，我们基于理论分析建立了可靠建模框架。现有SD研究主要聚焦算法接受率的提升，但工作负载与模型架构的变化仍可能导致高接受率下的SD加速效果下降。针对这一局限，我们提出"目标效率"新指标来表征这些影响，帮助研究者系统定位瓶颈并全面理解SD加速机制。对于私有服务等现有解决方案乏力的场景，本研究为加速MoE推理提供了新视角。在不同GPU上的实验表明，Qwen2-57B-A14B模型在中等批量大小下最高可获得2.29倍加速，验证了理论预测的正确性。

LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

Abstract

arXiv:2505.19667v1 Announce Type: cross Abstract: Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.

摘要

法律咨询对于保障个人权利和实现司法公正至关重要，但由于专业人员短缺，其高昂成本使许多人难以获得服务。尽管大语言模型（LLMs）的最新进展为可扩展、低成本的司法援助提供了可行路径，现有系统仍难以应对现实咨询中交互性强且知识密集的特性。为解决这些问题，我们推出LeCoDe——一个包含3,696个法律咨询对话（共计110,008轮对话）的真实多轮对话基准数据集，旨在评估和提升LLMs的法律咨询能力。该数据集创新性地采集自短视频平台的直播咨询记录，提供真实的多轮法律对话样本。通过法律专家严格的标注流程，数据集进一步融入了专业见解与行业知识。此外，我们提出包含（1）澄清能力与（2）专业建议质量的双维评估框架，整合12项指标形成统一评价体系。在对各类通用及领域专用LLMs的实验中，结果表明该任务存在显著挑战：即便是GPT-4等前沿模型，其澄清召回率仅达39.8%，建议质量综合得分仅为59%，凸显专业咨询场景的复杂性。基于这些发现，我们进一步探索了提升LLMs法律咨询能力的多种策略。本基准数据集将推动法律领域对话系统的研究发展，特别是在模拟更贴近现实世界的用户-专家交互方面。

Abstract

arXiv:2505.19675v1 Announce Type: cross Abstract: The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model's generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.

摘要

传统标注数据集的构建过程劳动密集且成本高昂。开源大语言模型（LLM）的最新突破为自然语言处理（NLP）任务提供了一种自动生成标注数据集的新途径，替代了这一昂贵的人工标注流程。然而由于固有的不准确性，此类自动生成标签的可靠性仍存在重大隐患。当模型从带噪声的标签中学习时，其泛化能力很可能因过度拟合标签噪声而受损。现有关于噪声标签学习的研究主要集中于合成噪声和真实场景噪声，而对LLM生成标签噪声的关注较少。本文提出SiDyP框架：基于动态先验的单形标签扩散方法，通过校准分类器预测来增强其对LLM生成噪声标签的鲁棒性。SiDyP通过在文本嵌入空间中检索邻域标签分布获取潜在真实标签候选，并利用单形扩散模型迭代优化噪声候选。实验表明，我们的框架能使基于零样本和少样本LLM生成噪声标签数据集微调的BERT分类器性能平均提升7.21%和7.30%。我们通过对不同LLM在多种NLP任务上进行广泛基准测试，验证了SiDyP的有效性。相关代码已在Github开源。

GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models

Abstract

arXiv:2505.19660v1 Announce Type: cross Abstract: Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI

摘要

开放域问答（OpenQA）是自然语言处理（NLP）领域的核心任务，主要致力于从非结构化文本数据中提取答案。随着大语言模型（LLMs）的快速发展，与传统方法相比，基于LLM的OpenQA方法凭借海量参数带来的涌现性理解与回答能力获得了显著优势。然而，这些方法大多面临两大关键挑战：如何有效将知识整合到LLMs中，以及如何针对不同任务场景自适应生成具有特定答案格式的结果。为解决这些问题，我们提出名为GenKI的创新框架，通过同步探索LLMs的知识整合与可控生成来提升OpenQA性能。具体而言，我们首先训练稠密段落检索模型从给定知识库中检索关联知识；随后提出新型知识整合模型，在微调阶段将检索知识融入指令以增强模型；此外，为实现LLMs的可控生成，我们采用特定微调LLM及基于文本一致性的集成方法，确保生成内容在连贯性、流畅性和答案格式三方面的统一性。最终，在TriviaQA、MSMARCO和CMRC2018等包含多样化答案格式的数据集上的大量实验表明，相较于最先进基线方法，GenKI具有显著优势。消融研究进一步揭示了检索知识出现频率与模型依据标准答案准确召回知识能力之间的线性关系。GenKI代码已开源：https://github.com/USTC-StarTeam/GenKI

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Abstract

arXiv:2505.19700v1 Announce Type: cross Abstract: The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.

摘要

大型语言模型（LLMs）在各行业的广泛应用提升了对高质量、可定制化输出的需求。然而，传统对齐方法通常需要重新训练大型预训练模型，难以快速适应和优化LLMs的多样化应用场景。为克服这一局限，我们提出一种新型的\textit{残差对齐模型}（\textit{RAM}），将对齐过程形式化为一种重要性采样。在该框架中，未对齐的上游模型作为提议分布，而对齐过程则基于自回归对齐模块（作为重要性权重的估计器）进行二次采样。这种设计使得对齐模块能够自然地与目标对齐模型分离，从而提高灵活性和可扩展性。基于该模型，我们推导出针对对齐模块的高效序列级训练策略，该策略独立于提议模块运行。此外，为解决同类方法中普遍存在的首词延迟问题，我们开发了具有迭代词级解码功能的重采样算法。在两个领先开源LLMs上的多任务实验评估（包括指令跟随、领域适应和偏好优化）表明，本方法始终优于基线模型。

Graceful Forgetting in Generative Language Models

Abstract

arXiv:2505.19715v1 Announce Type: cross Abstract: Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.

摘要

近年来，预训练-微调范式已成为各深度学习领域的基石。尽管预训练模型通常能提升下游任务微调的效果和效率，但研究表明并非所有预训练获得的知识都具有益处。部分知识实际上可能对微调任务产生有害影响，这种现象被称为负迁移。为解决该问题，优雅遗忘作为一种有前景的方法应运而生。其核心原理是通过选择性剔除无关知识来增强目标任务的学习可塑性。然而，该方法在生成式语言模型领域仍探索不足，且由于架构不兼容性，现有遗忘算法往往难以迁移至此类模型。为填补这一空白，本文提出新型框架"学习与遗忘"(LWF)，以实现生成式语言模型的优雅遗忘。LWF利用费雪信息矩阵对目标参数更新进行加权，计算遗忘置信度以评估模型自生成知识相对于遗忘任务的价值，进而周期性地剔除高置信度知识。实验表明，尽管在预训练语言模型中彻底揭示知识交互机制仍具挑战性，但应用优雅遗忘能有效提升微调性能。

Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

Abstract

arXiv:2505.19706v1 Announce Type: cross Abstract: Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.

摘要

大型语言模型（LLMs）容易产生幻觉，尤其在数学问题求解等多跳且需要密集推理的任务中。虽然结果奖励模型仅验证最终答案，但过程奖励模型（PRMs）会对每个中间步骤进行评分，以引导生成连贯的解决方案。我们提出PathFinder-PRM，这是一种新颖的、分层次的、具有错误感知能力的判别式PRM，它首先对每个步骤中的数学错误和一致性错误进行分类，然后结合这些细粒度信号来评估步骤的正确性。为了训练PathFinder-PRM，我们构建了一个包含40万样本的数据集，该数据集通过为人工标注的PRM800K语料库和RLHFlow Mistral轨迹添加三维步骤级标签而得到增强。在PRMBench上，PathFinder-PRM以67.7的PRMScore创造了新的最优记录，优于之前的最佳结果（65.5），同时使用的数据量减少了3倍。当应用于奖励引导的贪婪搜索时，我们的模型实现了prm@8 48.3，比最强基线提高了1.5个百分点。这些结果表明，解耦的错误检测和奖励估计不仅提升了细粒度错误检测能力，还显著改善了端到端、奖励引导的数学推理，并具有更高的数据效率。

NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Abstract

arXiv:2505.19754v1 Announce Type: cross Abstract: The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at https://github.com/X-LANCE/NeuSym-RAG.

摘要

日益增长的学术论文数量对研究者高效获取关键信息提出了重大挑战。尽管检索增强生成（RAG）技术在基于大语言模型（LLM）的自动问答中展现出巨大潜力，但现有研究往往将神经检索与符号检索割裂对待，未能结合二者的互补优势。此外，传统的单视角文本分块方法忽略了PDF文档丰富的结构与版式特征（如章节和表格）。本研究提出NeuSym-RAG框架，通过交互式流程整合神经与符号检索范式。该框架采用多视角分块和基于模式的解析技术，将半结构化PDF内容同时组织到关系型数据库和向量库中，使LLM智能体能够迭代收集上下文直至生成充足答案。在三个基于完整PDF的问答数据集（包括自行标注的AIRQA-REAL数据集）上的实验表明，NeuSym-RAG稳定优于基于向量的RAG及多种结构化基线方法，凸显其统一两种检索范式并利用多视角信息的能力。代码与数据已公开于https://github.com/X-LANCE/NeuSym-RAG。

Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

Abstract

arXiv:2505.19722v1 Announce Type: cross Abstract: Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.

摘要

生物医学实体链接旨在将非标准实体映射至知识库中的标准实体。传统监督方法虽表现良好，但需要大量标注数据进行迁移，限制了其在低资源场景下的应用。大语言模型（LLMs），尤其是闭源LLMs，虽能解决这些问题，但存在稳定性风险和高经济成本：这些模型的使用受商业公司限制，且处理大量数据时会带来显著经济负担。为此，我们提出"RPDR"框架，通过结合闭源与开源LLMs，对经少量数据微调的检索器所获候选结果进行重排序。该方法通过提示闭源LLM从未标注数据生成训练数据，并微调开源LLM进行重排序，从而有效将知识蒸馏至可本地部署的开源LLM，规避了稳定性问题与高成本问题。我们在两个数据集（包含一个真实世界数据集和一个涉及中英双语的公开数据集）上评估RPDR。当训练数据不足时，RPDR在Aier数据集和Ask A Patient数据集上分别实现0.019和0.036的Acc@1提升。实验结果验证了该框架的优越性与泛化能力。

FoodTaxo: Generating Food Taxonomies with Large Language Models

Abstract

arXiv:2505.19838v1 Announce Type: cross Abstract: We investigate the utility of Large Language Models for automated taxonomy generation and completion specifically applied to taxonomies from the food technology industry. We explore the extent to which taxonomies can be completed from a seed taxonomy or generated without a seed from a set of known concepts, in an iterative fashion using recent prompting techniques. Experiments on five taxonomies using an open-source LLM (Llama-3), while promising, point to the difficulty of correctly placing inner nodes.

摘要

我们研究了大型语言模型在食品技术行业分类法生成与补全任务中的应用效能。通过采用最新的提示技术，我们以迭代方式探索了以下两种场景：从种子分类法进行补全，或仅基于已知概念集无种子生成分类法。使用开源LLM（Llama-3）在五个分类体系上的实验结果表明，虽然前景可观，但正确放置内部节点仍存在显著困难。

MT $^{3}$ : Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Abstract

arXiv:2505.19714v1 Announce Type: cross Abstract: Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT $^{3}$ , the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT $^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT $^{3}$ -7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

摘要

文本图像机器翻译（TIMT）——即翻译嵌入图像中的文本内容的任务——对于无障碍访问、跨语言信息获取和现实场景文档理解等应用至关重要。然而，由于需要精确的光学字符识别（OCR）、鲁棒的视觉-文本推理以及高质量的翻译，TIMT仍然是一项复杂的挑战，通常需要级联的多阶段处理流程。尽管大规模强化学习（RL）的最新进展提升了大型语言模型（LLMs）和多模态LLMs（MLLMs）的推理能力，但其在端到端TIMT中的应用仍未充分探索。为填补这一空白，我们提出了MT $^{3}$ ——首个将多任务RL应用于MLLMs以实现端到端TIMT的框架。MT $^{3}$ 采用针对三个关键子技能（文本识别、上下文感知推理和翻译）的多任务优化范式，通过新型多混合奖励机制进行训练，该机制将基于规则的RL策略适配于TIMT的复杂性，为各任务提供细粒度的非二元反馈。此外，为促进TIMT在真实跨文化和现实社交媒体场景中的评估，我们构建了首个社交媒体TIMT基准XHSPost。我们的MT $^{3}$ -7B-Zero模型在最新领域内MIT-10M基准测试中取得了最先进成果，在多项指标上显著优于Qwen2.5-VL-72B和InternVL2.5-78B等强基线模型，同时展现出对分布外语言对和数据集的强大泛化能力。深度分析揭示了多任务协同、强化学习初始化、课程设计及奖励机制如何共同推动MLLM驱动的TIMT发展。

Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding

Abstract

arXiv:2505.19764v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms state-of-the-art methods in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.

摘要

大型语言模型（LLMs）已在多样化任务中展现出卓越能力，但基于LLM的智能体系统优化仍面临挑战，这源于智能体配置、提示策略和通信模式所构成的庞大搜索空间。现有方法通常依赖启发式调优或穷举评估，存在计算成本高且次优的问题。本文提出Agentic Predictor——一种用于高效评估智能体工作流程的轻量级预测器。该预测器采用多视图工作流编码技术，通过融合代码架构、文本提示和交互图特征，实现智能体系统的多视角表征学习。为在显著减少训练预测器所需工作流评估次数的同时保持高预测精度，Agentic Predictor采用跨领域无监督预训练方法。通过学习近似任务成功率，该预测器能快速准确地为给定任务选择最优智能体工作流配置，大幅减少昂贵试错评估的需求。在涵盖三个领域的精心构建基准测试中，实验表明我们的预测器在预测精度和工作流效用方面均优于最先进方法，凸显了性能预测器在优化基于LLM的智能体工作流设计方面的潜力。

Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Abstract

arXiv:2505.19815v1 Announce Type: cross Abstract: We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.

摘要

我们提出了一种通过元学习视角理解大语言模型（LLM）推理能力的新框架。通过将推理轨迹概念化为对LLM参数的伪梯度下降更新，我们揭示了LLM推理与各类元学习范式之间的内在关联。我们将推理任务的训练过程形式化为元学习框架：每个问题被视为独立任务，推理轨迹则作为调整模型参数的内循环优化过程。当模型在多样化问题集上完成训练后，即可发展出能泛化至未见问题的基本推理能力。大量实证研究证实了LLM推理与元学习之间的紧密联系，并从元学习角度探讨了若干具有重要价值的问题。本研究不仅深化了对LLM推理机制的理解，还为通过成熟元学习技术改进这些模型提供了实践启示。

Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification

Abstract

arXiv:2505.19776v1 Announce Type: cross Abstract: Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts.

摘要

大型语言模型（LLM）中编码的政治偏见可能对下游应用产生负面影响。现有偏见分析方法依赖于小规模中间任务（问卷回答或政治内容生成），并依赖LLM自身进行分析，从而导致偏见传播。我们提出一种新方法，利用LLM对同一句子中不同目标实体的情感预测存在差异这一现象，定义基于熵的不一致性度量来量化这种预测变异。我们在450个政治相关句子中插入1319个具有人口统计和政治多样性特征的政客姓名，使用六种广泛使用语言的七个模型进行目标导向情感预测。所有测试组合均观察到不一致性，并通过统计稳健方法在不同粒度层面进行聚合分析。研究发现对左翼和极右政客存在正向和负向偏见，且政治立场相近的政客间呈现正相关性。西方语言中的偏见强度高于其他语言。更大规模的模型表现出更强且更一致的偏见，同时缩小了相似语言间的差异。通过将政客姓名替换为虚构但合理的替代名称，我们部分缓解了LLM在目标导向情感分类（TSC）中的不可靠性问题。

Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages

Abstract

arXiv:2505.19851v1 Announce Type: cross Abstract: Transliteration, the process of mapping text from one script to another, plays a crucial role in multilingual natural language processing, especially within linguistically diverse contexts such as India. Despite significant advancements through specialized models like IndicXlit, recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training. The current work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages. Experiments utilized standard benchmarks, including Dakshina and Aksharantar datasets, with performance assessed via Top-1 Accuracy and Character Error Rate. Our findings reveal that while GPT family models generally outperform other LLMs and IndicXlit for most instances. Additionally, fine-tuning GPT-4o improves performance on specific languages notably. An extensive error analysis and robustness testing under noisy conditions further elucidate strengths of LLMs compared to specialized models, highlighting the efficacy of foundational models for a wide spectrum of specialized applications with minimal overhead.

摘要

音译作为将文本从一种文字系统映射到另一种文字系统的过程，在多语言自然语言处理中具有关键作用，尤其在印度等语言多样性突出的地区。尽管IndicXlit等专用模型已取得显著进展，但大语言模型的最新发展表明，通用模型无需经过特定任务训练即可在此类任务中表现卓越。本研究系统评估了GPT-4o、GPT-4.5、GPT-4.1、Gemma-3-27B-it和Mistral-Large等主流大语言模型与最先进的音译模型IndicXlit在十种主要印度语言上的性能对比。实验采用Dakshina和Aksharantar等标准基准数据集，通过Top-1准确率和字符错误率进行评估。研究发现：虽然GPT系列模型在多数情况下优于其他大语言模型及IndicXlit；此外，对GPT-4o进行微调可显著提升其在特定语言上的表现。通过详尽的错误分析及噪声环境下的鲁棒性测试，进一步揭示了大语言模型相较于专用模型的优势，证明了基础模型在各类专业应用中只需极低开销即可实现高效能。

APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization

Abstract

arXiv:2505.19912v1 Announce Type: cross Abstract: We present Adjacent Possible Exploration (APE), a simple yet effective method for adapting large language models to specific tasks using minimal computational resources. Unlike traditional fine-tuning that requires extensive compute, APE iteratively fine-tunes models on small, carefully selected data batches (200 examples), retaining only improvements. On news summarization, APE achieves 40 percent BLEU improvement using just a T4 GPU in 60 minutes, matching or exceeding more complex methods like LoRA while remaining conceptually simple. Our approach is particularly valuable for researchers and practitioners with limited computational resources. We provide open-source code and demonstrate APE's effectiveness through both automatic metrics and human evaluation. While inspired by evolutionary theory's "adjacent possible", APE's core insight has a very practical application: small, iterative data perturbations can efficiently guide LLMs toward task-specific performance without expensive retraining.

摘要

我们提出"邻近可能探索"（Adjacent Possible Exploration, APE），这是一种利用最小计算资源使大语言模型适配特定任务的简单而有效的方法。与传统需要大量计算资源的微调不同，APE通过在小规模精选数据批次（200个示例）上迭代微调模型，仅保留性能提升。在新闻摘要任务中，APE仅使用T4 GPU在60分钟内就实现了40%的BLEU分数提升，其表现匹配或超越LoRA等更复杂的方法，同时保持概念上的简洁性。该方法对计算资源有限的研究者和实践者具有特殊价值。我们开源了相关代码，并通过自动指标和人工评估证明了APE的有效性。虽然受进化论"邻近可能"概念的启发，但APE的核心洞见具有非常实际的应用价值：通过小而迭代的数据扰动，可以高效引导大语言模型实现特定任务性能，而无需昂贵的重新训练。

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Abstract

arXiv:2505.19914v1 Announce Type: cross Abstract: Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.

Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities

Abstract

arXiv:2505.19887v1 Announce Type: cross Abstract: Large language models (LLMs) have shown promise in software engineering, yet their effectiveness for binary analysis remains unexplored. We present the first comprehensive evaluation of commercial LLMs for assembly code deobfuscation. Testing seven state-of-the-art models against four obfuscation scenarios (bogus control flow, instruction substitution, control flow flattening, and their combination), we found striking performance variations--from autonomous deobfuscation to complete failure. We propose a theoretical framework based on four dimensions: Reasoning Depth, Pattern Recognition, Noise Filtering, and Context Integration, explaining these variations. Our analysis identifies five error patterns: predicate misinterpretation, structural mapping errors, control flow misinterpretation, arithmetic transformation errors, and constant propagation errors, revealing fundamental limitations in LLM code processing.We establish a three-tier resistance model: bogus control flow (low resistance), control flow flattening (moderate resistance), and instruction substitution/combined techniques (high resistance). Universal failure against combined techniques demonstrates that sophisticated obfuscation remains effective against advanced LLMs. Our findings suggest a human-AI collaboration paradigm where LLMs reduce expertise barriers for certain reverse engineering tasks while requiring human guidance for complex deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.x deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.

摘要

大语言模型（LLMs）在软件工程领域已展现出潜力，但其在二进制分析中的有效性尚未得到探索。我们首次对商用LLMs在汇编代码反混淆方面进行了全面评估。通过测试七种最先进的模型在四种混淆场景（虚假控制流、指令替换、控制流平坦化及其组合）下的表现，我们发现了显著的性能差异——从自主反混淆到完全失效。我们提出了一个基于四个维度的理论框架：推理深度、模式识别、噪声过滤和上下文整合，以解释这些差异。我们的分析识别出五种错误模式：谓词误解、结构映射错误、控制流误解、算术转换错误和常量传播错误，揭示了LLM代码处理的根本局限性。我们建立了一个三级抗性模型：虚假控制流（低抗性）、控制流平坦化（中等抗性）以及指令替换/组合技术（高抗性）。针对组合技术的普遍失效表明，复杂的混淆技术对先进的LLMs仍然有效。我们的研究结果提出了一种人机协作范式，即LLMs可以降低某些逆向工程任务的专业门槛，同时在复杂的反混淆任务中需要人类指导。这项工作为评估新兴能力和开发抗性混淆技术奠定了基础。

Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees

Abstract

arXiv:2505.19947v1 Announce Type: cross Abstract: Open-weight LLM zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of 2x cost savings compared to existing LLM routing techniques.

摘要

开源权重的大型语言模型（LLM）库提供了众多高质量模型，但为特定任务选择合适的模型仍具有挑战性且需要专业技术。大多数用户仅期望获得事实准确、安全且令人满意的回答，而无需关注模型技术细节；与此同时，推理服务提供商则优先考虑最小化运营成本。这些相互竞争的利益通常通过服务级别协议（SLA）来协调，该协议保障最低服务质量。我们提出MESS+，一种随机优化算法，可在严格保证SLA合规性的同时实现成本最优的LLM请求路由。MESS+通过用户与系统的实时交互学习各LLM的请求满足概率，并基于此通过求解每请求优化问题做出模型选择决策。我们的算法创新性地结合了虚拟队列和请求满足预测机制，并辅以成本最优性与约束满足的理论分析。在涵盖多种最先进LLM基准测试中，MESS+相比现有LLM路由技术平均实现2倍的成本节约。

Learning to Select In-Context Demonstration Preferred by Large Language Model

Abstract

arXiv:2505.19966v1 Announce Type: cross Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.

摘要

上下文学习（ICL）使大型语言模型（LLMs）能够在推理过程中仅通过少量示例适应新任务。然而，ICL的性能高度依赖于这些示例的选择。近期研究探索了基于检索的方法来选择与查询相关的示例，但这些方法通常依赖于替代目标（如度量学习），未能直接优化ICL性能。因此，它们难以识别真正有益的示例。此外，当候选池缺乏足够高质量示例时，其判别式检索范式效果有限。为解决这些问题，我们提出GenICL——一种新颖的生成式偏好学习框架，利用LLM反馈直接优化ICL的示例选择。在11个任务类别、19个数据集上的实验表明，GenICL在选择最有效示例方面优于现有方法，从而实现了更好的ICL性能。

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Abstract

arXiv:2505.19955v1 Announce Type: cross Abstract: Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

摘要

人工智能代理的最新进展表明其在推动和支持科学发现方面日益增长的潜力。本研究提出MLR-Bench——一个用于评估AI代理在开放式机器学习研究中表现的综合性基准。该基准包含三个核心组件：(1) 源自NeurIPS、ICLR和ICML研讨会的201项研究任务，涵盖多元机器学习主题；(2) MLR-Judge自动化评估框架，结合基于大语言模型的评审员与精心设计的评审标准来评估研究质量；(3) MLR-Agent模块化代理架构，能通过四个阶段（创意生成、方案制定、实验验证和论文撰写）完成研究任务。我们的框架既支持对这些研究阶段的逐步评估，也支持对最终研究论文的端到端评价。通过MLR-Bench对六个前沿大语言模型和先进编程代理的评估发现：虽然大语言模型能有效生成连贯创意和结构严谨的论文，但当前编程代理在80%的案例中会产生虚构或无效的实验结果——这成为科学可靠性的主要障碍。人工评估验证表明MLR-Judge与专家评审具有高度一致性，证实其作为可扩展研究评估工具的潜力。我们开源MLR-Bench以助力学术界对AI研究代理进行基准测试、问题诊断和改进，从而推动可信透明的科学发现。

The Limits of Preference Data for Post-Training

Abstract

arXiv:2505.19964v1 Announce Type: cross Abstract: Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$ -wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.

摘要

近年来，通过将强化学习应用于具有自动验证结果的领域，大语言模型的能力得到了显著提升。一个关键问题是，我们是否能够类似地利用强化学习来优化那些本质上需要人类反馈才能评估结果的领域；例如，在深度研究和旅行规划等任务中，结果评估是定性的，并且存在多种可能的成功程度。收集人类反馈的一种具有吸引力且可扩展的方式是偏好数据：序数排名（成对或k-wise），用于在给定的k个结果中指示哪个更受青睐。在这项工作中，我们研究了一个关键障碍：偏好数据从根本上且显著地限制了基于结果的优化。即使使用理想化的偏好数据（无限的、无噪声的且在线的），序数反馈的使用也可能阻碍获得近似最优解。我们利用投票理论形式化了这种不可能性，将模型选择回答查询的方式与选民选择候选人的方式进行了类比。这表明，为了将强化学习后训练的成功扩展到需要人类反馈的领域，必须依赖于可靠的人类评分和算法创新。我们还探讨了为什么这些限制在激发推理行为（例如回溯）时对RLHF的影响尤为显著，而在RLHF历史上取得成功的场景（例如指令调整和安全训练）中影响较小，发现偏好数据的限制主要抑制了RLHF激发稳健策略的能力——这类策略涵盖了大多数推理行为。

ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving

Abstract

arXiv:2505.20024v1 Announce Type: cross Abstract: Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in https://github.com/Liuxueyi/ReasonPlan.

摘要

由于具备强大的视觉-语言推理与泛化能力，多模态大语言模型（MLLMs）在端到端自动驾驶领域获得了广泛关注。然而，其在闭环系统中的应用仍待深入探索，且当前基于MLLM的方法尚未展现出对主流端到端模仿学习方案的明显优势。本研究提出ReasonPlan——一种通过自监督'下一场景预测'任务与监督式'决策思维链'过程进行整体推理的新型MLLM微调框架，专为闭环驾驶设计。该双重机制促使模型将视觉表征与可操作的驾驶语境对齐，同时推动可解释且因果关联的决策制定。我们构建了面向规划的决策推理数据集PDR，包含21万个多样化高质量样本。在Bench2Drive基准测试中，本方法以19%的L2指标和16.1的驾驶分数显著超越主流端到端模仿学习方法。此外，ReasonPlan在未见过的DOS基准上展现出强大的零样本泛化能力，突显其处理零样本极端案例的适应性。代码与数据集详见https://github.com/Liuxueyi/ReasonPlan。

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Abstract

arXiv:2505.19973v1 Announce Type: cross Abstract: Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.

摘要

数字取证与事件响应（DFIR）涉及分析数字证据以支持法律调查。大语言模型（LLM）为DFIR任务（如日志分析和内存取证）提供了新机遇，但其易出错和产生幻觉的特性在高风险场景中引发担忧。尽管关注度日益增长，目前仍缺乏全面评估LLM在理论与实际DFIR领域表现的基准。为此，我们提出DFIR-Metric基准，包含三个组成部分：（1）知识评估：700道经专家评审的多选题，源自行业标准认证和官方文档；（2）真实取证挑战：150项CTF风格任务，测试多步推理与证据关联能力；（3）实践分析：500个来自NIST计算机取证工具测试计划（CFTT）的磁盘与内存取证案例。我们使用DFIR-Metric评估了14个LLM，分析其准确性和多次试验的一致性。同时提出新指标——任务理解分数（TUS），用于在模型接近零准确率场景下更有效评估性能。该基准为推进人工智能在数字取证中的应用提供了严谨、可复现的基础。所有脚本、测试材料及结果详见项目网站https://github.com/DFIR-Metric。

SAEs Are Good for Steering -- If You Select the Right Features

Abstract

arXiv:2505.20063v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model's output. In this work, we draw a distinction between two types of features: input features, which mainly capture patterns in the model's input, and output features, which have a human-understandable effect on the model's output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: after filtering out features with low output scores, we obtain 2-3x improvements when steering with SAEs, making them competitive with supervised methods.

摘要

稀疏自编码器（SAE）作为一种无监督学习方法被提出，旨在分解模型的潜在空间。该方法无需标注数据即可实现诸如'引导'（即控制模型输出朝向特定概念）等实用应用。现有方法通过分析激活SAE特征的输入标记来确定引导特征，但近期研究表明，仅凭激活程度无法完整描述特征对模型输出的影响。本研究区分了两类特征：主要捕获模型输入模式的'输入特征'和对模型输出具有人类可理解影响的'输出特征'。我们提出了输入与输出评分体系来表征和定位这两类特征，并证明两种评分同时高值的情况在同一个特征中极为罕见。这些发现具有实际意义：通过过滤输出评分较低的特征，我们在使用SAE进行引导时获得了2-3倍的性能提升，使其达到与监督方法相当的水平。

Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)

Abstract

arXiv:2505.20029v1 Announce Type: cross Abstract: Transformer-based language models, though not explicitly trained to mimic brain recordings, have demonstrated surprising alignment with brain activity. Progress in these models-through increased size, instruction-tuning, and multimodality-has led to better representational alignment with neural data. Recently, a new class of instruction-tuned multimodal LLMs (MLLMs) have emerged, showing remarkable zero-shot capabilities in open-ended multimodal vision tasks. However, it is unknown whether MLLMs, when prompted with natural instructions, lead to better brain alignment and effectively capture instruction-specific representations. To address this, we first investigate brain alignment, i.e., measuring the degree of predictivity of neural visual activity using text output response embeddings from MLLMs as participants engage in watching natural scenes. Experiments with 10 different instructions show that MLLMs exhibit significantly better brain alignment than vision-only models and perform comparably to non-instruction-tuned multimodal models like CLIP. We also find that while these MLLMs are effective at generating high-quality responses suitable to the task-specific instructions, not all instructions are relevant for brain alignment. Further, by varying instructions, we make the MLLMs encode instruction-specific visual concepts related to the input image. This analysis shows that MLLMs effectively capture count-related and recognition-related concepts, demonstrating strong alignment with brain activity. Notably, the majority of the explained variance of the brain encoding models is shared between MLLM embeddings of image captioning and other instructions. These results suggest that enhancing MLLMs' ability to capture task-specific information could lead to better differentiation between various types of instructions, and thereby improving their precision in predicting brain responses.

摘要

基于Transformer的语言模型虽未经过明确训练以模拟大脑记录，却展现出与大脑活动惊人的一致性。这些模型通过扩大规模、指令微调及多模态融合的进步，实现了与神经数据更好的表征对齐。近期，一类新型指令微调多模态大语言模型（MLLMs）崭露头角，在开放式多模态视觉任务中展现出卓越的零样本能力。然而，当输入自然指令时，MLLMs是否能提升大脑对齐效果并有效捕捉指令特异性表征尚属未知。为此，我们首先研究大脑对齐性——即利用MLLMs的文本输出响应嵌入来预测参与者观看自然场景时的神经视觉活动程度。通过10种不同指令的实验表明，MLLMs的大脑对齐性显著优于纯视觉模型，并与CLIP等非指令微调多模态模型表现相当。我们还发现，虽然这些MLLMs能生成符合任务指令的高质量响应，但并非所有指令都与大脑对齐相关。进一步通过指令调控，我们使MLLMs编码了与输入图像相关的指令特异性视觉概念。分析显示MLLMs能有效捕捉计数相关和识别相关概念，与大脑活动呈现强对齐性。值得注意的是，大脑编码模型的大部分解释方差来自图像描述指令与其他指令的MLLM嵌入共享。这些结果表明，增强MLLMs捕获任务特异性信息的能力可更好区分各类指令，从而提高其预测大脑响应的精确度。

On the Same Page: Dimensions of Perceived Shared Understanding in Human-AI Interaction

Abstract

arXiv:2505.20068v1 Announce Type: cross Abstract: Shared understanding plays a key role in the effective communication in and performance of human-human interactions. With the increasingly common integration of AI into human contexts, the future of personal and workplace interactions will likely see human-AI interaction (HAII) in which the perception of shared understanding is important. Existing literature has addressed the processes and effects of PSU in human-human interactions, but the construal remains underexplored in HAII. To better understand PSU in HAII, we conducted an online survey to collect user reflections on interactions with a large language model when it sunderstanding of a situation was thought to be similar to or different from the participant's. Through inductive thematic analysis, we identified eight dimensions comprising PSU in human-AI interactions: Fluency, aligned operation, fluidity, outcome satisfaction, contextual awareness, lack of humanlike abilities, computational limits, and suspicion.

摘要

共享理解在人际互动的有效沟通与表现中起着关键作用。随着人工智能日益融入人类语境，未来个人与职场互动很可能会涉及人机交互（HAII），其中对共享理解的感知至关重要。现有文献已探讨了人际互动中共享理解的过程与影响，但其在人机交互中的构建机制仍待深入研究。为深入理解人机交互中的共享理解，我们开展了一项在线调查，收集用户与大型语言模型互动时的反馈，重点分析当模型对情境的理解与参与者相似或相异时的感知差异。通过归纳式主题分析，我们提炼出构成人机交互共享理解的八个维度：流畅性、操作一致性、交互自然度、结果满意度、情境感知力、非人类能力缺失、计算局限性以及信任疑虑。

Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Abstract

arXiv:2505.20053v1 Announce Type: cross Abstract: Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.

摘要

扩散模型已成为文本到图像生成的主流架构，在视觉质量和提示可控性方面取得显著进展。然而，当前推理流程普遍缺乏在整个去噪过程中可解释的语义监督与校正机制。现有方法大多仅依赖最终图像的事后评分、提示过滤或启发式重采样策略，无法为生成轨迹校正提供有效指导。这导致模型常出现对象混淆、空间错位、数量不准确及语义元素缺失等问题，严重损害提示-图像对齐与生成质量。针对这些挑战，我们提出多模态大语言模型语义校正乒乓前瞻扩散框架（PPAD），首次在推理过程中引入多模态大语言模型（MLLM）作为语义观察器。该框架实时分析中间生成结果，识别潜在语义不一致性，并将反馈转化为可控信号以主动引导后续去噪步骤。该方案支持纯推理和训练增强两种模式，仅需极少量扩散步骤即可实现语义校正，具有强通用性和可扩展性。大量实验证明了PPAD框架的显著改进效果。

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Abstract

arXiv:2505.20065v1 Announce Type: cross Abstract: As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.

摘要

随着大型语言模型（LLMs）的持续进步及其在日益广泛领域中的应用，确保LLMs的安全性变得愈发关键。为解决安全问题，近期研究提出将安全约束整合至基于人类反馈的强化学习（RLHF）中。然而，这些方法往往较为复杂，因其不仅包含RLHF的繁琐流程，还需额外处理安全约束引入的步骤。受直接偏好优化（DPO）启发，我们提出一种名为SafeDPO的新算法，该算法旨在策略学习的单阶段中直接优化安全对齐目标，无需松弛处理。SafeDPO仅引入一个额外超参数以进一步提升安全性，且仅需对标准DPO进行微小修改。因此，该方法无需单独拟合奖励与成本模型，也无需在微调阶段从语言模型采样，同时仍能增强LLMs的安全性。最后，我们证明SafeDPO在与人类偏好对齐及提升安全性方面，均能达到与最先进安全对齐算法相竞争的性能水平。

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Abstract

arXiv:2505.20047v1 Announce Type: cross Abstract: Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

摘要

大型语言模型（LLMs）通过生成形式化规范展现了推动自动化推理民主化的显著潜力。然而存在一个本质矛盾：LLMs具有概率性特征，而形式化验证需要确定性保证。本文通过全面研究LLM生成形式化制品的故障模式与不确定性量化（UQ），致力于弥合这一认知鸿沟。我们对五种前沿LLMs的系统性评估表明，基于可满足性模理论（SMT）的自动形式化在不同领域对准确率产生差异性影响（逻辑任务提升+34.8%至事实性任务下降44.5%），而现有UQ技术（如标记概率熵）无法识别这些错误。我们提出概率上下文无关文法（PCFG）框架来建模LLM输出，由此构建出细化的不确定性分类体系。研究发现不确定性信号具有任务依赖性（例如逻辑任务中文法熵的AUROC>0.93）。最终，通过轻量级融合这些信号可实现选择性验证，在极低弃用率下显著降低错误率（14-100%），从而将LLM驱动的形式化转变为可靠的工程规范。

Incentivizing Reasoning from Weak Supervision

Abstract

arXiv:2505.20072v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/W2SR.

摘要

大型语言模型（LLMs）在推理密集型任务中展现出卓越性能，但提升其推理能力通常依赖于两种昂贵方式：带有可验证信号的强化学习（RL）或基于高质量长链思维（CoT）演示的监督微调（SFT）。本文研究了一个新颖问题——如何在不依赖昂贵高质量演示和强化学习的情况下激励LLMs的推理能力。我们探究了显著弱模型的监督是否能有效激发强模型的推理能力，并深入分析了这种弱监督何时及为何能成功引发强模型的推理潜力。实验结果表明：显著弱推理器的监督能大幅提升学生模型的推理性能，以极低成本恢复了约94%的昂贵RL方法所获增益。跨多种基准测试和模型架构的实验证明，弱推理器能有效激励强学生模型的推理能力，在广泛推理任务中持续提升性能。我们的研究结果表明，这种简单的"弱监督强"范式是一种具有推广潜力的替代方案，能以较低成本在推理阶段激励LLMs的强推理能力。代码已开源于https://github.com/yuanyige/W2SR。

Language-Agnostic Suicidal Risk Detection Using Large Language Models

Abstract

arXiv:2505.20109v1 Announce Type: cross Abstract: Suicidal risk detection in adolescents is a critical challenge, yet existing methods rely on language-specific models, limiting scalability and generalization. This study introduces a novel language-agnostic framework for suicidal risk assessment with large language models (LLMs). We generate Chinese transcripts from speech using an ASR model and then employ LLMs with prompt-based queries to extract suicidal risk-related features from these transcripts. The extracted features are retained in both Chinese and English to enable cross-linguistic analysis and then used to fine-tune corresponding pretrained language models independently. Experimental results show that our method achieves performance comparable to direct fine-tuning with ASR results or to models trained solely on Chinese suicidal risk-related features, demonstrating its potential to overcome language constraints and improve the robustness of suicidal risk assessment.

摘要

青少年自杀风险检测是一项关键挑战，但现有方法依赖特定语言模型，限制了可扩展性和泛化能力。本研究提出了一种新型语言无关框架，利用大语言模型（LLMs）进行自杀风险评估。我们首先通过语音识别（ASR）模型从语音生成中文文本，随后采用基于提示查询的LLMs从这些文本中提取自杀风险相关特征。所提取的特征以中英文双语形式保留以实现跨语言分析，并分别用于独立微调相应的预训练语言模型。实验结果表明，该方法性能与直接使用ASR结果微调的模型或仅基于中文自杀风险特征训练的模型相当，证实了其在突破语言限制、提升自杀风险评估鲁棒性方面的潜力。

AdaTP: Attention-Debiased Token Pruning for Video Large Language Models

Abstract

arXiv:2505.20100v1 Announce Type: cross Abstract: Video Large Language Models (Video LLMs) have achieved remarkable results in video understanding tasks. However, they often suffer from heavy computational overhead due to the large number of visual tokens generated from multiple video frames. Existing visual token compression methods often rely on attention scores from language models as guidance. However, these scores exhibit inherent biases: global bias reflects a tendency to focus on the two ends of the visual token sequence, while local bias leads to an over-concentration on the same spatial positions across different frames. To address the issue of attention bias, we propose $\textbf{A}$ ttention- $\textbf{D}$ ebi $\textbf{a}$ sed $\textbf{T}$ oken $\textbf{P}$ runing for Video Large Language Models ( $\textbf{AdaTP}$ ), a novel token pruning pipeline for Video LLMs. AdaTP integrates two dedicated debiasing modules into the pipeline, targeting global attention bias and local attention bias, respectively. Without the need for additional training, our method significantly reduces the computational overhead of Video LLMs while retaining the performance of vanilla models. Extensive evaluation shows that AdaTP achieves state-of-the-art performance in various commonly used video understanding benchmarks. In particular, on LLaVA-OneVision-7B, AdaTP maintains performance without degradation while using only up to $27.3\%$ FLOPs compared to the vanilla model. Our code will be released soon.

摘要

视频大语言模型（Video LLMs）在视频理解任务中取得了显著成果，但由于多帧视频生成的大量视觉令牌，其常面临沉重的计算开销。现有视觉令牌压缩方法通常依赖语言模型的注意力分数作为指导，然而这些分数存在固有偏差：全局偏差表现为倾向于关注视觉令牌序列的两端，而局部偏差则导致不同帧间相同空间位置的过度集中。为解决注意力偏差问题，我们提出面向视频大语言模型的注意力去偏令牌剪枝框架（AdaTP），该新型令牌剪枝流程包含两个针对性去偏模块，分别处理全局和局部注意力偏差。该方法无需额外训练即可显著降低视频大语言模型的计算开销，同时保持原始模型性能。大量实验表明，AdaTP在多种常用视频理解基准测试中达到最先进水平。特别是在LLaVA-OneVision-7B模型上，AdaTP仅需至多27.3%的浮点运算量即可维持与原始模型相当的性能。相关代码即将公开。

Inference-time Alignment in Continuous Space

Abstract

arXiv:2505.20081v1 Announce Type: cross Abstract: Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ( $\textbf{SEA}$ ), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $\textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/SEA

摘要

基于人类反馈在推理阶段对齐大语言模型的方法因其灵活性而日益受到关注。现有方法依赖于基策略生成多个响应，通过奖励模型进行搜索，这可视为在离散响应空间中进行搜索。然而，当基策略较弱或候选集较小时，这些方法难以探索信息量丰富的候选方案，导致效果有限。本文针对该问题提出简单能量适配算法（SEA），这是一种简单高效的推理阶段对齐方法。与离散空间的高成本搜索不同，SEA通过连续潜空间中的梯度采样，直接将基策略的原始响应适配至最优响应。具体而言，SEA将推理过程建模为连续空间中基于最优策略定义的能量函数的迭代优化过程，实现简单有效的对齐。例如，尽管方法简洁，SEA在AdvBench上相对次优基线最高提升77.51%，在MATH数据集上提升16.36%。代码已开源：https://github.com/yuanyige/SEA

Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

Abstract

arXiv:2505.20099v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks because of their superior capabilities in natural language understanding and generation. However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG's role when integrating with LLMs. We systematically survey state-of-the-art advances in synthesizing LLMs and KGs for QA and compare and analyze these approaches in terms of strength, limitations, and KG requirements. We then align the approaches with QA and discuss how these approaches address the main challenges of different complex QA. Finally, we summarize the advancements, evaluation metrics, and benchmark datasets and highlight open challenges and opportunities.

摘要

大语言模型（LLMs）凭借其在自然语言理解与生成方面的卓越能力，在问答（QA）任务中展现出显著性能。然而，由于推理能力不足、知识陈旧以及幻觉问题，基于LLM的QA系统在处理复杂问答任务时仍面临挑战。近期若干研究通过整合LLMs与知识图谱（KGs）来解决上述问题。本综述提出了一种新的结构化分类法，根据QA任务的类别及KG在与LLM整合过程中的作用，对LLM与KG协同用于QA的方法论进行系统归类。我们全面综述了该领域的前沿进展，从方法优势、局限性及KG需求等维度对比分析了现有技术路径。随后，我们将这些方法与各类QA任务对齐，探讨其如何应对不同复杂QA的核心挑战。最后，本文总结了当前技术进展、评估指标与基准数据集，并指出了开放挑战与未来机遇。

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Abstract

arXiv:2505.20096v1 Announce Type: cross Abstract: We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on either end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, to tackle each stage of the RAG pipeline with task-aware reasoning. Ambiguities may arise from underspecified queries, sparse or indirect evidence in retrieved documents, or the need to integrate information scattered across multiple sources. MA-RAG mitigates these challenges by decomposing the problem into subtasks, such as query disambiguation, evidence extraction, and answer synthesis, and dispatching them to dedicated agents equipped with chain-of-thought prompting. These agents communicate intermediate reasoning and progressively refine the retrieval and synthesis process. Our design allows fine-grained control over information flow without any model fine-tuning. Crucially, agents are invoked on demand, enabling a dynamic and efficient workflow that avoids unnecessary computation. This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results. Experiments on multi-hop and ambiguous QA benchmarks demonstrate that MA-RAG outperforms state-of-the-art training-free baselines and rivals fine-tuned systems, validating the effectiveness of collaborative agent-based reasoning in RAG.

摘要

我们提出MA-RAG，一种用于检索增强生成（RAG）的多智能体框架，旨在解决复杂信息检索任务中固有的模糊性和推理挑战。与传统RAG方法依赖端到端微调或孤立组件增强不同，MA-RAG通过协调一组专业AI智能体（规划器、步骤定义器、提取器和问答智能体）进行任务感知推理，协同处理RAG流程的每个阶段。模糊性可能源于查询定义不明确、检索文档中证据稀疏或间接、以及需要整合分散在多个来源的信息等问题。MA-RAG通过将问题分解为子任务（如查询消歧、证据提取和答案合成）并分配给配备思维链提示的专用智能体来应对这些挑战。这些智能体通过交流中间推理结果，逐步优化检索与合成过程。我们的设计无需模型微调即可实现信息流的细粒度控制。关键的是，智能体按需调用，实现了动态高效的工作流程，避免不必要的计算。这种模块化且基于推理的架构使MA-RAG能够提供稳健、可解释的结果。在多跳和模糊问答基准测试中，MA-RAG优于最先进的无需训练基线系统，并与微调系统性能相当，验证了基于协作智能体推理在RAG中的有效性。

Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone

Abstract

arXiv:2505.20113v1 Announce Type: cross Abstract: The increased digitization of world's textual heritage poses significant challenges for both computer science and literary studies. Overall, there is an urgent need of computational techniques able to adapt to the challenges of historical texts, such as orthographic and spelling variations, fragmentary structure and digitization errors. The rise of large language models (LLMs) has revolutionized natural language processing, suggesting promising applications for Named Entity Recognition (NER) on historical documents. In spite of this, no thorough evaluation has been proposed for Italian texts. This research tries to fill the gap by proposing a new challenging dataset for entity extraction based on a corpus of 19th century scholarly notes, i.e. Giacomo Leopardi's Zibaldone (1898), containing 2,899 references to people, locations and literary works. This dataset was used to carry out reproducible experiments with both domain-specific BERT-based models and state-of-the-art LLMs such as LLaMa3.1. Results show that instruction-tuned models encounter multiple difficulties handling historical humanistic texts, while fine-tuned NER models offer more robust performance even with challenging entity types such as bibliographic references.

摘要

全球文本遗产数字化程度的提升为计算机科学和文学研究领域带来了重大挑战。当前亟需能够适应历史文本特征的计算技术，这些特征包括正字法与拼写变异、碎片化结构以及数字化错误等问题。大语言模型（LLM）的兴起彻底改变了自然语言处理领域，为历史文献中的命名实体识别（NER）提供了潜在应用前景。然而针对意大利语文本尚未开展系统评估研究。本研究通过构建基于19世纪学术笔记（即贾科莫·莱奥帕尔迪《杂记录》1898年版）的实体抽取挑战性数据集来填补这一空白，该数据集包含2,899条人物、地点及文学作品指涉。利用该数据集，我们分别对基于BERT的领域专用模型与LLaMa3.1等前沿大模型进行了可重复实验。结果表明：指令调优模型在处理历史人文文本时存在多重困难，而经过微调的NER模型即便面对书目引用等复杂实体类型仍能提供更稳健的性能表现。

ResSVD: Residual Compensated SVD for Large Language Model Compression

Abstract

arXiv:2505.20112v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models.Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

摘要

大语言模型（LLMs）在下游自然语言处理任务中展现出卓越能力，但其庞大的参数量与内存需求制约了实际部署，这凸显了开发高效压缩策略的重要性。奇异值分解（SVD）通过将矩阵分解为正交分量来实现高效低秩近似，尤其适用于权重矩阵普遍存在显著冗余的LLM压缩。然而，现有基于SVD的方法忽略截断后的残差矩阵，导致严重的截断损失。此外，对模型所有层进行压缩会引发显著的性能下降。为克服这些局限，我们提出ResSVD——一种新的基于SVD的LLM训练后压缩方法。具体而言，我们利用截断过程中产生的残差矩阵来降低截断误差；同时，在固定总体压缩率条件下，选择性压缩模型最后几层以抑制误差传播，从而显著提升压缩模型的性能。在多种LLM架构和基准数据集上的综合评估表明，ResSVD始终优于现有同类方法，验证了其实际有效性。

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Abstract

arXiv:2505.20139v1 Announce Type: cross Abstract: As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

摘要

随着大型语言模型(LLMs)在软件开发工作流程中的广泛应用，其生成结构化输出的能力变得至关重要。我们提出StructEval——一个全面评估LLMs生成非可渲染(JSON、YAML、CSV)与可渲染(HTML、React、SVG)结构化格式能力的基准测试。与现有基准不同，StructEval通过两种范式系统评估跨多样格式的结构保真度：1)生成任务：从自然语言提示生成结构化输出；2)转换任务：在结构化格式间进行转换。本基准涵盖18种格式和44类任务，并引入格式遵循度和结构正确性的新型评估指标。实验结果表明存在显著性能差距，即使最先进的o1-mini模型平均得分仅75.58，开源替代品落后约10分。研究发现生成任务比转换任务更具挑战性，且生成正确可视化内容比纯文本结构更为困难。

THiNK: Can Large Language Models Think-aloud?

Abstract

arXiv:2505.20184v1 Announce Type: cross Abstract: Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

摘要

评估大型语言模型(LLMs)的高阶思维能力仍是一个根本性挑战，特别是在超越表面准确性的任务中。本研究提出THiNK(高阶知识概念测试)——一个基于布鲁姆分类学的多智能体、反馈驱动的评估框架。THiNK将推理评估构建为问题生成、批判与修订的迭代过程，通过逐步反思与改进促使LLMs进行出声思考。该方法可系统评估低阶(如记忆、理解)和高阶(如评价、创造)思维技能。我们对七种最先进的LLMs应用THiNK框架，并对其输出进行详细认知分析。结果表明：虽然模型能可靠完成低阶认知任务，但在现实情境中应用知识时存在困难，且表现出有限的抽象能力。结构化反馈循环显著提升了推理表现，尤其在高阶思维方面。定性评估进一步证实，经THiNK引导的输出更符合领域逻辑与问题结构。本框架代码提供了一种可扩展的方法论，用于探索和增强LLM推理能力，为基于学习科学的评估开辟了新方向，相关代码已发布于GitHub仓库。

Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Abstract

arXiv:2505.20152v1 Announce Type: cross Abstract: Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at https://github.com/THU-KEG/MMGeoLM.

摘要

得益于在大规模自然场景图像上通过对比训练获得的视觉编码器，大型多模态模型（LMMs）在各种视觉感知任务中表现出色。然而，对比学习基于概括性描述的固有局限性，从根本上限制了模型在细致推理（尤其是几何问题求解等关键场景）中的能力。为增强几何理解能力，我们提出了一种面向视觉编码器的硬负样本对比学习框架：该方法结合了基于图像的对比学习（通过扰动图表生成代码创建生成式硬负样本）与基于文本的对比学习（使用基于规则修改的几何描述生成负样本，以及基于标题相似度筛选的检索式负样本）。我们采用这种强负样本学习方法训练CLIP模型（称为MMCLIP，即多模态数学CLIP），进而训练用于几何问题求解的LMM。实验表明，我们训练的MMGeoLM模型在三个几何推理基准测试中显著优于其他开源模型。即使仅有70亿参数规模，其性能仍可媲美GPT-4o等强大的闭源模型。我们进一步研究了不同负样本构建方法及负样本数量对LMM几何推理性能的影响，得出了具有启发性的结论。代码与数据集详见https://github.com/THU-KEG/MMGeoLM。

Parameter-Efficient Fine-Tuning with Column Space Projection

Abstract

arXiv:2505.20211v1 Announce Type: cross Abstract: Fine-tuning large language models (LLMs) with minimal computational overhead is essential for efficiently adapting them to downstream tasks under resource constraints. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), facilitate this by updating only a small subset of parameters. However, recent studies show that LoRA diverges from full fine-tuning (Full FT) in its learning behavior, particularly in terms of spectral properties. Motivated by these findings, we propose PiCa, the first theoretically grounded PEFT method based on the spectral properties of fine-tuned weights. PiCa projects gradients onto the low-rank column subspace of pre-trained weights and exhibits learning patterns more closely aligned with Full FT. Furthermore, we show that combining PiCa with weight sharing drastically reduces the number of trainable parameters without compromising performance, enabling to achieve superior performance than LoRA using 13x fewer trainable parameters. Extensive experiments demonstrate PiCa achieves the state-of-the-art performance compared to existing PEFT methods.

摘要

在有限计算资源下高效调整大语言模型（LLMs）以适应下游任务，需实现计算开销最小化的微调。参数高效微调（PEFT）方法（如低秩自适应LoRA）通过仅更新少量参数实现这一目标。然而最新研究表明，LoRA在学习行为（尤其是频谱特性方面）与全参数微调（Full FT）存在差异。基于此发现，我们提出首个基于频谱特性的理论驱动型PEFT方法PiCa：该方法将梯度投影至预训练权重的低秩列子空间，其学习模式与Full FT更为接近。进一步研究表明，PiCa结合权重共享技术可大幅减少可训练参数量（13倍于LoRA）且保持性能无损，最终实现超越LoRA的卓越表现。大量实验证明，PiCA相较现有PEFT方法达到了最先进的性能水平。

WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Abstract

arXiv:2505.20249v1 Announce Type: cross Abstract: Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.

摘要

气候变化适应需要理解极端天气事件对社会的影响，而大型语言模型(LLMs)可能适用于此领域。然而，由于高质量语料收集的困难以及缺乏可用基准，其有效性尚未得到充分探索。地区性报纸中记录的气候相关事件反映了社区如何适应灾害并从灾难中恢复，但原始语料处理并非易事。本研究首先开发了一个包含四阶段精细构建流程的极端天气影响数据集。随后，我们提出了WXImpactBench——首个评估LLMs在极端天气影响理解能力的基准测试。该基准包含两项评估任务：多标签分类和基于排序的问答系统。通过对一系列LLMs开展广泛实验，我们首次系统分析了开发极端天气影响理解与气候变化适应系统所面临的挑战。所构建的数据集及评估框架代码已公开，以助力社会提升灾害防御能力。

Evaluating Large Language Models for Code Review

Abstract

arXiv:2505.20206v1 Announce Type: cross Abstract: Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs' performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time, respectively, and corrected the code 67.83% and 54.26% of the time for the 492 code blocks of varying correctness. Without problem descriptions, performance declined. The results for the 164 canonical code blocks differed, suggesting that performance depends on the type of code. Conclusion: LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs. We propose a process that involves humans, called the "Human in the loop LLM Code Review" to promote knowledge sharing while mitigating the risk of faulty outputs.

摘要

背景：代码审查对软件质量至关重要。随着人工智能的进步，大型语言模型（LLMs）已能够审查和修改代码，目前已有工具执行此类审查。然而，其可靠性和准确性尚未得到系统评估。目的：本研究比较不同LLMs在检测代码正确性和提出改进建议方面的表现。方法：我们在492个不同正确性的AI生成代码块及164个来自HumanEval基准的标准代码块上测试了GPT4o和Gemini 2.0 Flash。为客观模拟代码审查任务，我们要求LLMs评估代码正确性并在必要时改进代码。实验采用不同配置并报告结果。结果：在提供问题描述时，GPT4o和Gemini 2.0 Flash对492个代码块的正确性分类准确率分别为68.50%和63.89%，代码修正成功率分别为67.83%和54.26%。无问题描述时性能下降。164个标准代码块的结果存在差异，表明性能取决于代码类型。结论：LLM代码审查可辅助改进建议和正确性评估，但存在错误输出风险。我们提出一种包含人类的"人在循环LLM代码审查"流程，以促进知识共享同时降低错误输出风险。

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Abstract

arXiv:2505.20166v1 Announce Type: cross Abstract: Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where important textual capabilities such as instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about their reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making the process resource-intensive. To address these issues, we leverage the backbone LLMs from ALLMs to synthesize general-purpose caption-style alignment data. We refer to this process as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Building on BALSa, we introduce LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method designed to improve ALLMs' ability to distinguish between present and absent sounds. We further extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption that describes them all, thereby enhancing audio-language alignment. Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to the development of ALLMs.

摘要

音频感知大语言模型（ALLMs）近期在音频输入的理解与处理方面取得显著进展。这些模型通常通过基于文本的大语言模型（LLMs）进行音频相关任务的额外训练而适配获得。然而，该适配过程存在两大主要局限：首先，ALLMs常出现灾难性遗忘现象，即在音频数据训练后丧失指令跟随等关键文本能力，某些情况下模型甚至可能幻听输入音频中不存在的声音，引发可靠性担忧；其次，实现音频与语言的跨模态对齐通常需要大量特定任务的问答对进行指令微调，导致过程资源密集。为解决这些问题，我们利用ALLMs的骨干LLMs合成通用型标题式对齐数据，将此过程称为"通过骨干LLMs合成数据生成的音频-语言对齐引导（BALSa）"。基于BALSa，我们提出LISTEN（通过扩展负样本学习识别声音），这是一种类对比训练方法，旨在提升ALLMs区分存在与不存在声音的能力。我们进一步将BALSa扩展至多音频场景，使模型能解释音频输入间的差异或生成统一描述所有输入的字幕，从而增强音频-语言对齐。实验结果表明，我们的方法有效缓解了音频幻听问题，同时可靠地保持了音频理解、推理和指令跟随能力的强劲表现。此外，引入多音频训练可进一步提升模型的理解与推理能力。总体而言，BALSa为ALLMs的开发提供了一种高效且可扩展的解决方案。

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Abstract

arXiv:2505.20161v1 Announce Type: cross Abstract: Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's $\rho \approx 0.9$ ) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.

摘要

语言模型的有效泛化能力关键取决于其训练数据的多样性。然而现有多样性度量方法往往未能实现这一目标，它们依赖于与模型行为脱节的表层启发式指标。这促使我们思考：训练数据中何种多样性真正驱动语言模型的泛化能力？我们又该如何量化和增强这种多样性？通过对300多次训练过程的大规模实证分析（严格控制数据规模与质量），我们发现数据多样性可作为大语言模型推理泛化能力的强预测指标——该泛化能力通过模型在未见过的分布外基准测试中的平均表现来衡量。我们提出G-Vendi这一通过模型诱导梯度熵来量化多样性的指标。尽管采用现成的小型代理模型计算梯度，G-Vendi始终优于其他度量方法，在自然语言推理（NLI）和数学推理任务中与分布外（OOD）性能均呈现强相关性（Spearman's ρ≈0.9）。基于此发现，我们提出棱镜合成框架，通过瞄准梯度空间中低表征区域来生成多样化合成数据。实验结果表明，随着合成数据规模扩大，棱镜合成不仅能提升模型在分布内测试的表现，更能持续改善其在未见过的分布外基准测试中的性能——显著优于依赖比我们数据生成器大20倍的现有最优模型。例如，我们基于32B大模型蒸馏得到的PrismMath-7B，在7项挑战性基准测试中有6项表现优于R1-Distill-Qwen-7B（该基线模型使用671B R1生成的专有数据训练）。

KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing

Abstract

arXiv:2505.20245v1 Announce Type: cross Abstract: Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi-hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM's context. However, the ever-growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher-quality multi-step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self-bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi-hop question answering benchmarks, and the bootstrapped version further amplifies the gains.

摘要

检索增强生成（RAG）技术的最新进展为大型语言模型（LLM）提供了迭代检索相关信息的能力，以处理复杂的多跳问题。这些方法通常交替进行LLM推理和检索，将外部信息逐步积累到LLM的上下文中。然而，不断增长的上下文本质上加重了LLM感知关键信息间联系的负担，而无效的推理步骤进一步加剧了这一过载问题。本文提出KnowTrace，一个精巧的RAG框架，旨在（1）缓解上下文过载，（2）引导更高质量的多步推理。KnowTrace并非简单堆砌检索内容，而是自主追踪所需知识三元组，构建与输入问题相关的特定知识图谱。这种结构化工作流不仅为LLM提供了可理解的推理上下文，还自然激发了知识回溯的反思机制，通过识别有贡献的LLM生成作为过程监督数据来实现自我引导。大量实验表明，KnowTrace在三个多跳问答基准测试中 consistently 超越现有方法，而引导版本进一步放大了优势。

DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

Abstract

arXiv:2505.20241v1 Announce Type: cross Abstract: Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.

摘要

推理能力显著提升了大型语言模型（LLMs）在复杂任务中的表现。当前推理研究的核心——过程奖励模型（PRMs）——能够对中间推理步骤进行细粒度评估并指导推理过程。然而，将PRMs扩展至多模态大型语言模型（MLLMs）面临诸多挑战。由于多模态推理相比纯文本场景覆盖更广泛的任务范围，从训练集到测试集的分布偏移更为严重，导致泛化难度增大。因此，训练可靠的多模态PRM需要大规模多样化数据集以确保充分覆盖，但现有多模态推理数据集存在显著的质量不平衡问题，这会降低PRM性能并凸显有效数据选择策略的必要性。为解决这些问题，我们提出DreamPRM——一种采用双层优化的领域重加权多模态PRM训练框架。在底层优化中，DreamPRM通过领域权重对多数据集进行微调，使PRM能优先学习高质量推理信号，缓解数据集质量不平衡的影响；在顶层优化中，PRM在元学习数据集上进行评估，通过聚合损失函数反馈更新领域权重，从而提升训练后PRM的泛化能力。在涵盖数学与通用推理的多模态推理基准测试中，大量实验表明采用DreamPRM的测试时缩放能持续提升前沿MLLMs的性能。进一步对比显示，DreamPRM的领域重加权策略优于其他数据选择方法，且比现有测试时缩放方法获得更高的准确率提升。

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

Abstract

arXiv:2505.20254v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

摘要

稀疏自编码器（SAEs）是机制可解释性（MI）研究中的关键工具，用于将神经网络激活分解为可解释特征。然而，现有研究发现不同训练周期学得的SAE特征存在不一致性，这对识别规范特征集的构想提出了挑战，并削弱了MI研究的可靠性与效率。本立场论文主张机制可解释性研究应优先关注SAE的特征一致性——即通过独立训练能够稳定收敛到等效特征集的能力。我们提出采用配对字典平均相关系数（PW-MCC）作为量化一致性的实用指标，并通过实验证明：通过合理的架构选择可实现高度一致性（在LLM激活数据上TopK SAE的PW-MCC达0.80）。本文贡献包括：（1）系统阐述特征一致性的优势；（2）基于模型生物体进行理论论证与合成验证，证实PW-MCC可作为真实特征恢复的可靠代理指标；（3）将结论拓展至真实LLM数据，证明高特征一致性与所学特征解释的语义相似性存在强相关性。我们呼吁学界转向系统性测量特征一致性，以推动MI领域形成稳健的累积性进展。

Lifelong Safety Alignment for Language Models

Abstract

arXiv:2505.20259v1 Announce Type: cross Abstract: LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

摘要

大语言模型（LLMs）已取得显著进展，但其日益增强的能力也使其面临旨在绕过安全对齐的高度灵活的越狱攻击。虽然现有防御多集中于已知攻击类型，但更关键的是让LLMs能够应对部署时可能出现的未知攻击。为此，我们提出了一种终身安全对齐框架，使LLMs能持续适应新型演变的越狱策略。该框架构建了元攻击者与防御者之间的竞争机制：元攻击者通过训练主动发现新型越狱策略，防御者则训练抵抗这些策略。为有效预热元攻击者，我们首先利用GPT-4o API从大量越狱相关研究论文中提取关键洞见。经过迭代训练，首轮元攻击者仅通过单轮攻击即在RR上实现73%的攻击成功率（ASR），在LAT上达到57%的迁移ASR。同时，防御者逐步提升鲁棒性，最终将元攻击者的成功率降至仅7%，从而在开放环境中实现更安全可靠的LLM部署。代码发布于https://github.com/sail-sg/LifelongSafetyAlignment。

Reasoning LLMs are Wandering Solution Explorers

Abstract

arXiv:2505.20296v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.

摘要

大语言模型（LLMs）通过思维链提示和基于树的推理等测试时计算（TTC）技术，已展现出令人印象深刻的推理能力。然而，我们认为当前的推理大语言模型（RLLMs）缺乏系统探索解空间的能力。本文形式化了系统性解决问题的构成要素，并识别出常见的失败模式，这些模式揭示了推理大语言模型更像是漫游者而非系统性探索者。通过对多种最先进大语言模型的定性与定量分析，我们发现了持续存在的问题：无效的推理步骤、冗余的探索、幻觉或不可信的结论等。我们的研究结果表明，当前模型在简单任务上的表现可能看似胜任，但随着复杂性增加，其性能会急剧下降。基于这些发现，我们主张采用新的评估指标和工具，不仅要评估最终输出，还要评估推理过程本身的结构。

Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?

Abstract

arXiv:2505.20295v1 Announce Type: cross Abstract: To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. Our metric enables future works towards this universal form of LLM uncertainties.

摘要

为了揭示大型语言模型（LLM）对回答存在不确定性的时刻，不确定性量化通常会在输出时附带百分比数值。但这是否是我们唯一能做的？我们认为，在LLM的输出空间——即字符串的集合中——存在具有足够表达力的字符串，能够概括模型认为可能的输出字符串分布。本文为这一不确定性解释的新途径奠定了理论基础，并提出SelfReflect这一理论驱动的度量标准，用于评估字符串对LLM内部答案分布总结的忠实程度。实验表明，SelfReflect能够区分候选摘要字符串间细微的差异，且与人类判断一致，其表现优于LLM评判和嵌入比较等替代性指标。借助SelfReflect，我们研究了多种自我总结方法，发现即使最先进的推理模型也难以准确阐明其内部不确定性。但我们发现，通过采样与总结可以生成忠实的概括性描述。该度量标准为未来探索LLM不确定性的通用表达形式奠定了基础。

The Coverage Principle: A Framework for Understanding Compositional Generalization

Abstract

arXiv:2505.20278v1 Announce Type: cross Abstract: Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a \emph{mechanism-based} taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.

摘要

大型语言模型擅长模式匹配，但在系统性组合泛化方面往往表现欠佳。我们提出覆盖原则：一个以数据为中心的框架，表明主要依赖模式匹配处理组合任务的模型，其泛化能力仅限于替换那些在相同上下文中产生相同结果的片段。我们证明该框架对Transformer模型的泛化能力具有强大预测力。首先，我们推导并实证验证：实现双跳泛化所需的训练数据量随标记集规模至少呈二次方增长，且20倍的参数缩放无法提升训练数据效率。其次，针对存在路径歧义的组合任务（即单个变量通过多重计算路径影响输出），我们发现Transformer会学习上下文相关的状态表示，这会损害模型性能与互操作性。第三，思维链监督能提升多跳任务的训练数据效率，但仍难以解决路径歧义问题。最后，我们提出基于机制的分类法，区分神经网络实现泛化的三种方式：基于结构的（受覆盖原则约束）、基于属性的（利用代数不变性）和共享运算符的（通过函数复用）。这一概念框架为研究结果提供了理论背景，并指出需要新的架构设计以实现系统性组合能力。总体而言，覆盖原则为理解组合推理提供了统一视角，强调要实现真正的系统性组合性，必须在架构或训练方法上进行根本性创新。

Does quantization affect models' performance on long-context tasks?

Abstract

arXiv:2505.20276v1 Announce Type: cross Abstract: Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.

摘要

当前大语言模型（LLMs）已支持超过128K标记的上下文窗口，但这带来了显著的内存需求和较高的推理延迟。量化技术可降低这些成本，但可能导致性能下降。本研究首次对长输入（>64K标记）和长输出任务中的量化LLMs进行了系统评估。我们的评估涵盖9.7K个测试样本、五种量化方法（FP8、GPTQ-int8、AWQ-int4、GPTQ-int4、BNB-nf4）及五个模型（Llama-3.1 8B与70B；Qwen-2.5 7B、32B和72B）。研究发现：8位量化平均能保持准确率（下降约0.8%），而4位方法会导致显著损失，尤其涉及长上下文输入的任务（最高下降59%）。当输入语言为非英语时，这种性能退化往往更为严重。关键的是，量化效果高度依赖于量化方法、模型和任务。例如，Qwen-2.5 72B在BNB-nf4下保持稳健，而Llama-3.1 70B在同一任务中性能下降32%。这些发现表明，在部署量化LLMs前，特别是在长上下文场景和非英语语言环境中，必须进行细致的任务特异性评估。

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Abstract

arXiv:2505.20298v1 Announce Type: cross Abstract: Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

摘要

漫画（日本漫画）是一种高度多模态的叙事形式，以复杂方式融合图像与文本。通过训练大型多模态模型（LMMs）实现类人类水平的漫画叙事理解，可帮助创作者反思并优化其作品。为此，我们提出两个多模态漫画理解基准：针对页内文本识别的MangaOCR，以及通过视觉问答评估上下文理解的新基准MangaVQA。MangaVQA包含526组高质量人工构建的问答对，能在多样化叙事与视觉场景中进行可靠评估。基于这些基准，我们开发了MangaLMM——一个从开源LMM Qwen2.5-VL微调而成的漫画专用模型，可协同处理两项任务。通过包括GPT-4o和Gemini 2.5等专有模型对比在内的广泛实验，我们系统评估了LMMs的漫画理解能力。本研究的基准与模型为漫画这一强叙事性领域中LMMs的评估与推进奠定了全面基础。

A Generative Approach to Credit Prediction with Learnable Prompts for Multi-scale Temporal Representation Learning

Abstract

arXiv:2404.13004v4 Announce Type: replace Abstract: Recent industrial credit scoring models remain heavily reliant on manually tuned statistical learning methods. While deep learning offers promising solutions, its effectiveness is often limited by the complexity of financial data, particularly in long-horizon scenarios. In this work, we propose FinLangNet, which addresses credit scoring by reframing it as the task of generating multi-scale distributions of a user's future behavior. Within this framework, tabular data is transformed into sequential representations, enabling the generation of user embeddings across multiple temporal scales. Inspired by the recent success of prompt-based training in Large Language Models (LLMs), FinLangNet also introduces two types of prompts to model and capture user behavior at both the feature-granularity and user-granularity levels. Experimental results demonstrate that FinLangNet outperforms the online XGBoost benchmark, achieving a 7.2% improvement in KS metric performance and a 9.9% reduction in the relative bad debt rate. Furthermore, FinLangNet exhibits superior performance on public UEA archives, underscoring its scalability and adaptability in time series classification tasks.

摘要

当前工业界的信用评分模型仍严重依赖人工调参的统计学习方法。尽管深度学习提供了有前景的解决方案，但其有效性常受限于金融数据的复杂性，尤其在长周期场景下。本研究提出FinLangNet模型，通过将信用评分重构为用户未来行为多尺度分布的生成任务来解决该问题。在此框架下，表格数据被转化为序列化表示，从而生成跨多时间尺度的用户嵌入表征。受大语言模型（LLMs）中基于提示的训练方法近期成功的启发，FinLangNet还引入两类提示符，分别在特征粒度和用户粒度层面建模与捕捉用户行为。实验结果表明，FinLangNet优于线上XGBoost基准模型，KS指标性能提升7.2%，相对坏账率降低9.9%。此外，该模型在公开UEA档案库上表现出卓越性能，印证了其在时间序列分类任务中的可扩展性与适应性。

Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier

Abstract

arXiv:2405.17956v4 Announce Type: replace Abstract: For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to easily tune language models to maximize auxiliary, non-preferential objectives according to the LLM designer's preferences (e.g., tuning lexical style or minimizing specific kinds of harmful content). Critically, these designer objectives may not be amply human-labeled or represented in available data, align with user preferences, or even be able to be captured tractably by binary preference pairs. To leverage the simplicity and performance of DPO with the generality of RL, we propose a unified approach. Based on a simple decomposition of preference and auxiliary objectives, we allow for tuning LLMs to optimize user and designer preferences without any additional specialized or preference data, computational cost, stability ``tweaks'', or training instability. The proposed method, Unified Preference Optimization, shows the ability to effectively generalize to user preferences and auxiliary objectives, while preserving or surpassing alignment performance on challenging benchmarks across a range of model sizes.

摘要

在大型语言模型（LLM）对齐领域，先前研究主要采用基于人类反馈的强化学习（RLHF）或直接偏好优化（DPO）的变体方法。尽管DPO通过最大似然估计提供了更简洁的框架，但其在灵活调整语言模型以最大化设计者设定的辅助性非偏好目标（如调整词汇风格或最小化特定有害内容）方面存在局限。关键问题在于，这些设计目标可能缺乏充分的人工标注数据、与用户偏好不一致，或难以通过二元偏好对有效捕捉。为结合DPO的简洁高效与强化学习的通用性，我们提出了一种统一方法。通过分解偏好目标与辅助目标，该方法无需额外专用数据、偏好数据、额外计算开销或稳定性调整，即可实现用户偏好与设计者偏好的联合优化。所提出的统一偏好优化方法在多种模型规模下的基准测试中，既能有效泛化至用户偏好和辅助目标，又保持或超越了现有对齐方法的性能表现。

Algorithmic Language Models with Neurally Compiled Libraries

Abstract

arXiv:2407.04899v2 Announce Type: replace Abstract: Important tasks such as reasoning and planning are fundamentally algorithmic, meaning that solving them robustly requires acquiring true reasoning or planning algorithms, rather than shortcuts. Large Language Models lack true algorithmic ability primarily because of the limitations of neural network optimization algorithms, their optimization data and optimization objective, but also due to architectural inexpressivity. To solve this, our paper proposes augmenting LLMs with a library of fundamental operations and sophisticated differentiable programs, so that common algorithms do not need to be learned from scratch. We add memory, registers, basic operations, and adaptive recurrence to a transformer architecture built on LLaMA3. Then, we define a method for directly compiling algorithms into a differentiable starting library, which is used natively and propagates gradients for optimization. In this preliminary study, we explore the feasability of augmenting LLaMA3 with a differentiable computer, for instance by fine-tuning small transformers on simple algorithmic tasks with variable computational depth.

摘要

推理和规划等重要任务本质上是算法性的，这意味着要稳健地解决这些问题需要获得真正的推理或规划算法，而非捷径。大型语言模型缺乏真正的算法能力，主要源于神经网络优化算法、优化数据及优化目标的局限性，同时也受架构表达力不足的影响。为此，本文提出通过增强LLMs（大型语言模型）的基础操作库和复杂可微分程序，使常见算法无需从头学习。我们在基于LLaMA3的Transformer架构中增加了内存、寄存器、基本操作和自适应循环机制。随后，我们定义了一种将算法直接编译为可微分初始库的方法，该库可原生使用并通过梯度传播进行优化。在本初步研究中，我们探索了为LLaMA3配备可微分计算机的可行性，例如通过在具有可变计算深度的简单算法任务上微调小型Transformer模型来实现。

Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System

Abstract

arXiv:2410.09403v3 Announce Type: replace Abstract: The rapid advancement of scientific progress requires innovative tools that can accelerate knowledge discovery. Although recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short of replicating the collaborative nature of real-world scientific practices, where diverse experts work together in teams to tackle complex problems. To address the limitations, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VirSci), designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas. We further investigate the collaboration mechanisms that contribute to its tendency to produce ideas with higher novelty, offering valuable insights to guide future research and illuminating pathways toward building a robust system for autonomous scientific discovery. The code is available at https://github.com/open-sciencelab/Virtual-Scientists.

摘要

科学研究的快速发展需要能够加速知识发现的创新工具。尽管当前人工智能方法（尤其是大语言模型）在假设生成和实验设计等任务中展现出潜力，但这些方法仍难以复现现实科研实践中多领域专家团队协作解决复杂问题的特性。为突破这一局限，我们提出基于大语言模型的多智能体系统——虚拟科学家（VirSci），该系统旨在模拟科研活动中固有的团队协作机制。VirSci通过组织智能体团队协同生成、评估和完善研究构想。综合实验表明，这种多智能体方法在产生新颖科学构想方面优于现有最优方法。我们进一步探究了促成其产生更高新颖性构想的协作机制，这些发现不仅为未来研究提供了重要指引，也为构建自主科学发现的稳健系统指明了路径。项目代码详见https://github.com/open-sciencelab/Virtual-Scientists。

ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving

Abstract

arXiv:2411.07228v3 Announce Type: replace Abstract: To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemToolAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemToolAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.

摘要

为了提升大语言模型（LLMs）在化学问题解决中的能力，已有多个基于LLM的工具增强型智能体被提出，例如ChemCrow和Coscientist。然而，这些系统的评估范围较为局限，导致我们对于工具在不同化学任务中效益的理解存在较大空白。为填补这一空白，我们开发了ChemToolAgent——一个在ChemCrow基础上改进的化学智能体，并对其在专业化学任务和普通化学问题上的表现进行了全面评估。令人惊讶的是，ChemToolAgent在使用工具的情况下并未持续优于其基础LLMs。通过与化学专家进行的错误分析表明：对于合成预测等专业化学任务，我们应为智能体配备专用工具；但对于考试类普通化学问题，智能体正确运用化学知识进行推理的能力更为关键，工具增强并不总能带来帮助。

P $^2$ Law: Scaling Law for Post-Training After Model Pruning

Abstract

arXiv:2411.10272v3 Announce Type: replace Abstract: Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs). To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation. While post-training benefits from larger datasets, once the dataset size is already substantial, increasing the training data provides only limited performance gains. To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.Through extensive experiments on the Llama-3 and Qwen-2.5 series models, pruned using various common pruning methods, we uncover the scaling \textbf{Law} for \textbf{P}ost-training after model \textbf{P}runing, referred to as the P $^2$ Law.This law identifies four key factors for predicting the pruned model's post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model's loss before pruning. Moreover, P $^2$ Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.

摘要

剪枝已成为降低大语言模型（LLM）硬件需求广泛采用的技术。为恢复剪枝后的模型性能，通常采用训练后处理来缓解性能下降。虽然训练后处理受益于更大规模的数据集，但当数据集规模已足够大时，增加训练数据带来的性能提升有限。为平衡训练后成本与模型性能，需探索最优的训练后数据量。通过对Llama-3和Qwen-2.5系列模型进行大量实验（采用多种常见剪枝方法），我们揭示了模型剪枝后训练后的缩放定律（称为P²定律）。该定律确定了预测剪枝模型训练后损失的四个关键因素：剪枝前的模型规模、训练后标记数量、剪枝率以及剪枝前模型的损失。此外，P²定律可推广至更大规模数据集、更大模型尺寸和更高剪枝率，为剪枝后LLM的训练后处理提供了重要指导。

SaVe-TAG: Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs

Abstract

arXiv:2410.16882v3 Announce Type: replace Abstract: Real-world graph data often follows long-tailed distributions, making it difficult for Graph Neural Networks (GNNs) to generalize well across both head and tail classes. Recent advances in Vicinal Risk Minimization (VRM) have shown promise in mitigating class imbalance with numeric interpolation; however, existing approaches largely rely on embedding-space arithmetic, which fails to capture the rich semantics inherent in text-attributed graphs. In this work, we propose our method, SaVe-TAG (Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs), a novel VRM framework that leverages Large Language Models (LLMs) to perform text-level interpolation, generating on-manifold, boundary-enriching synthetic samples for minority classes. To mitigate the risk of noisy generation, we introduce a confidence-based edge assignment mechanism that uses graph topology as a natural filter to ensure structural consistency. We provide theoretical justification for our method and conduct extensive experiments on benchmark datasets, showing that our approach consistently outperforms both numeric interpolation and prior long-tailed node classification baselines. Our results highlight the importance of integrating semantic and structural signals for balanced and effective learning on text-attributed graphs.

摘要

现实世界中的图数据往往遵循长尾分布，这使得图神经网络（GNNs）难以在头部和尾部类别上均实现良好的泛化性能。邻近风险最小化（VRM）领域的最新进展表明，数值插值方法在缓解类别不平衡问题方面具有潜力；然而现有方法主要依赖于嵌入空间算术运算，无法捕捉文本属性图中固有的丰富语义信息。本研究提出SaVe-TAG方法（面向长尾文本属性图的语义感知邻近风险最小化框架），这是一种创新的VRM框架，通过利用大语言模型（LLMs）执行文本级插值，为少数类生成流形上的边界增强合成样本。为降低噪声生成风险，我们引入基于置信度的边分配机制，以图拓扑结构作为天然过滤器来确保结构一致性。我们为该方法提供了理论证明，并在基准数据集上进行了广泛实验，结果表明我们的方法始终优于数值插值方法和现有长尾节点分类基线。研究结果凸显了整合语义与结构信号对于文本属性图上实现平衡有效学习的重要性。

BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving

Abstract

arXiv:2411.17404v4 Announce Type: replace Abstract: LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source datasets in operations research domain lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, an algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency, enabling faster retrieval of correct solutions. The StructuredOR dataset is available on Huggingface https://huggingface.co/datasets/LLM4OR/StructuredOR and GitHub https://github.com/LLM4OR/StructuredOR.

摘要

大语言模型展现出高阶推理能力，具备将自然语言问题转化为数学模型的潜力。然而现有运筹学领域的开源数据集缺乏对建模过程的详细标注（如变量定义），仅聚焦目标函数值，这制约了强化学习的应用。为此，我们发布结构化运筹数据集StructuredOR，该数据集通过全面标注完整记录了数学建模过程。我们进一步提出BPP-Search算法，该算法通过集成束搜索、过程奖励模型和成对偏好算法，将强化学习融入思维树架构。这种方法能高效探索树状结构，在避免穷举搜索的同时提升准确性。在StructuredOR、NL4OPT和MAMO-ComplexLP数据集上的大量实验表明，BPP-Search显著优于现有最优方法。在树状推理任务中，BPP-Search在准确性和效率方面表现卓越，能更快检索到正确解。StructuredOR数据集已发布于Huggingface平台（https://huggingface.co/datasets/LLM4OR/StructuredOR）和GitHub（https://github.com/LLM4OR/StructuredOR）。

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Abstract

arXiv:2408.12757v2 Announce Type: replace Abstract: Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems' performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving--compute, memory, networking--are executed sequentially within a device. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. NanoFlow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. NanoFlow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate NanoFlow's end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 50% to 72% of optimal throughput across popular models.

摘要

大型语言模型（LLMs）的兴起导致了对行星级服务系统的激增需求，其中数万个GPU持续为数亿用户提供服务。因此，吞吐量已成为决定服务系统性能的关键指标。由于模型规模庞大且内存密集的自注意力机制，LLM服务通常被假定为内存受限。通过详细分析，我们发现尽管存在内存密集型组件，但对于大多数常见工作负载和LLM模型而言，端到端的LLM服务实际上是计算受限的。然而，现有大多数服务引擎未能实现最优计算利用率，因为构成LLM服务的异构操作（计算、内存、网络）在设备内是顺序执行的。

我们提出NanoFlow，一种利用设备内并行性的新型服务框架，通过在单个设备内重叠使用异构资源来实现性能提升。NanoFlow将输入分割为更小的纳米批次，并通过复制操作使每个部分独立运行以实现重叠处理。该框架自动确定纳米批次的数目、大小、顺序及GPU资源分配，以最小化执行时间，同时考虑并发操作的相互干扰。我们在LLaMA-2-70B、Mixtral 8x7B、LLaMA-3-8B等流行模型上评估了NanoFlow的端到端服务吞吐量。实验表明，在实际工作负载下，NanoFlow相比最先进的服务系统可实现1.91倍的吞吐量提升，在主流模型上达到最优吞吐量的50%至72%。

Better Think with Tables: Tabular Structures Enhance LLM Comprehension for Data-Analytics Requests

Abstract

arXiv:2412.17189v2 Announce Type: replace Abstract: Large Language Models (LLMs) often struggle with data-analytics requests related to information retrieval and data manipulation that frequently arise in real-world scenarios under multiple conditions. In this paper, we introduce Thinking with Tables, where we inject tabular structures into LLMs for data-analytics requests. Through comprehensive evaluations across various request types, we show that providing tabular structures yields a 40.29 percent average performance gain along with better robustness and token efficiency. Through attention-value analysis, we uncover that tables help LLMs better attend to relevant information, explaining these improvements. Beyond tables and text, we evaluate whether (1) blending structuredness within text, such as providing templates or fixing the order of attributes, and (2) other representative structures, such as knowledge graphs and JSON, are helpful. We observe that utilizing tables offers the best balance between efficiency and effectiveness. These advantages remain consistent under increased task complexity and even when all input data cannot be structured. Finally, as data analytics typically relies on structured factual inputs, our text-to-table conversion demonstrates the method's applicability to text-compatible data sources.

摘要

大型语言模型（LLMs）在处理现实场景中多条件下频繁出现的信息检索与数据操作等数据分析请求时往往表现欠佳。本文提出"表格思维"方法，通过将表格结构注入LLMs来处理数据分析请求。经多种请求类型的综合评估表明，提供表格结构可带来40.29%的平均性能提升，并具有更好的鲁棒性和标记效率。通过注意力值分析，我们发现表格能帮助LLMs更好地关注相关信息，从而解释这些改进。除表格和文本外，我们还评估了：（1）在文本中融入结构化（如提供模板或固定属性顺序）；（2）其他代表性结构（如知识图谱和JSON）是否有效。实验表明表格在效率与效果间实现了最佳平衡。这些优势在任务复杂度增加甚至输入数据无法完全结构化时仍保持稳定。最后，鉴于数据分析通常依赖结构化的事实输入，本文提出的文本-表格转换方法证明了该技术对文本兼容数据源的适用性。

Demonstration Selection for In-Context Learning via Reinforcement Learning

Abstract

arXiv:2412.03966v2 Announce Type: replace Abstract: Diversity in demonstration selection is critical for enhancing model generalization by enabling broader coverage of structures and concepts. Constructing appropriate demonstration sets remains a key research challenge. This paper introduces the Relevance-Diversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning (RL) frameworks to optimize the selection of diverse reference demonstrations for tasks amenable to in-context learning (ICL), particularly text classification and reasoning, in few-shot prompting scenarios. RDES employs frameworks like Q-learning and a PPO-based variant to dynamically identify demonstrations that maximize both diversity (quantified by label distribution) and relevance to the task objective. This strategy ensures a balanced representation of reference data, leading to improved accuracy and generalization. Through extensive experiments on multiple benchmark datasets, including diverse reasoning tasks, and involving 14 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances performance compared to ten established baselines. Our evaluation includes analysis of performance across varying numbers of demonstrations on selected datasets. Furthermore, we investigate incorporating Chain-of-Thought (CoT) reasoning, which further boosts predictive performance. The results highlight the potential of RL for adaptive demonstration selection and addressing challenges in ICL.

摘要

演示样本的多样性对于提升模型泛化能力至关重要，它能够实现对结构和概念的更广泛覆盖。构建合适的演示集仍是当前研究的关键挑战。本文提出相关性-多样性增强选择方法（RDES），这是一种创新性方案，利用强化学习（RL）框架来优化少样本提示场景中适用于上下文学习（ICL）任务（特别是文本分类与推理任务）的多样化参考样本选择。RDES采用Q学习和基于PPO的变体等框架，动态识别能够同时最大化多样性（通过标签分布量化）和任务目标相关性的演示样本。该策略确保了参考数据的平衡表征，从而提升准确性和泛化能力。通过在多个基准数据集（包括多样化推理任务）上开展的广泛实验，并涉及14个闭源和开源大型语言模型，我们证明RDES相较十种现有基线方法显著提升了性能。评估内容包括对选定数据集在不同演示样本数量下的性能分析。此外，我们研究了思维链（CoT）推理的引入，这进一步提升了预测性能。研究结果凸显了强化学习在自适应演示样本选择及解决ICL挑战方面的潜力。

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Abstract

arXiv:2501.08192v2 Announce Type: replace Abstract: Large language models (LLMs) are typically served from clusters of GPUs/NPUs that consist of large number of devices. Unfortunately, communication between these devices incurs significant overhead, increasing the inference latency and cost while limiting the scalability. Prior work addressed this issue by overlapping communication with compute, but has severe limitations due to the data dependencies between these operations. In this paper, we propose PRESERVE, a novel framework that prefetches model weights and KV-cache from off-chip HBM memory to the on-chip cache of AI accelerators during the communication operations, which offers various advantages and performance improvements compared to prior methods. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.

摘要

大型语言模型（LLMs）通常部署在由大量设备组成的GPU/NPU集群上运行。然而，这些设备间的通信会产生显著开销，不仅增加了推理延迟和成本，还限制了系统的可扩展性。现有研究尝试通过通信与计算重叠来缓解该问题，但由于操作间的数据依赖性，这类方法存在严重局限性。本文提出PRESERVE框架，该创新方案在通信操作期间将模型权重和KV缓存从片外高带宽内存（HBM）预取至AI加速器的片内缓存，相较于现有方法具有多重优势与性能提升。

通过在商用AI加速器上进行大量实验，我们在最先进的开源LLMs上实现了最高1.6倍的端到端加速。此外，通过设计空间探索确定了该方法的硬件最优配置，结果显示选择最佳L2缓存容量可进一步提升1.25倍的性价比。实验结果表明，PRESERVE能有效缓解内存瓶颈与通信开销，为提升LLM推理系统的性能与可扩展性提供了解决方案。

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Abstract

arXiv:2501.17161v2 Announce Type: replace Abstract: Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

摘要

监督微调（SFT）与强化学习（RL）是基础模型广泛使用的后训练技术，但其对模型泛化能力的提升机制尚不明确。本文研究了SFT与RL在文本规则变体和视觉变体任务中泛化性与记忆性的差异。我们引入算术推理卡牌游戏GeneralPoints和真实世界导航环境V-IRL，评估两种方法在文本与视觉领域对未见变体的泛化表现。研究表明：RL（尤其是基于结果奖励的训练）能同时泛化至基于规则的文本变体和视觉变体；而SFT倾向于记忆训练数据，在分布外场景中泛化能力受限。进一步分析表明，RL通过提升模型底层视觉识别能力来增强视觉领域的泛化性。尽管RL具有更优的泛化性能，但SFT对有效RL训练仍不可或缺——其能稳定模型输出格式，为后续RL实现性能增益奠定基础。这些发现证实了RL在复杂多模态任务中获取可泛化知识的能力。

Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values

Abstract

arXiv:2501.07071v2 Announce Type: replace Abstract: As Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative for their responsible development and customized applications. However, there still lack evaluations of LLMs values that fulfill three desirable goals. (1) Value Clarification: We expect to clarify the underlying values of LLMs precisely and comprehensively, while current evaluations focus narrowly on safety risks such as bias and toxicity. (2) Evaluation Validity: Existing static, open-source benchmarks are prone to data contamination and quickly become obsolete as LLMs evolve. Additionally, these discriminative evaluations uncover LLMs' knowledge about values, rather than valid assessments of LLMs' behavioral conformity to values. (3) Value Pluralism: The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment. To address these challenges, we presents the Value Compass Leaderboard, with three correspondingly designed modules. It (i) grounds the evaluation on motivationally distinct \textit{basic values to clarify LLMs' underlying values from a holistic view; (ii) applies a \textit{generative evolving evaluation framework with adaptive test items for evolving LLMs and direct value recognition from behaviors in realistic scenarios; (iii) propose a metric that quantifies LLMs alignment with a specific value as a weighted sum over multiple dimensions, with weights determined by pluralistic values.

摘要

随着大型语言模型（LLMs）取得显著突破，将其价值观与人类对齐已成为负责任开发和定制化应用的关键需求。然而，当前仍缺乏满足三个理想目标的LLMs价值观评估体系：（1）价值澄清：需要精确全面地阐明LLMs的潜在价值观，而现有评估仅狭隘地关注偏见和毒性等安全风险；（2）评估效度：静态开源基准测试易受数据污染影响，且随着LLMs演进快速过时。此外，这些判别式评估仅揭示LLMs对价值观的认知，而非对其行为符合价值观的有效测评；（3）价值多元性：人类价值观在个体与文化间的多元特性在LLMs价值对齐测量中被普遍忽视。针对这些挑战，我们提出包含三个对应模块的'价值指南针排行榜'：其（i）基于动机差异的'基础价值观'构建评估体系，从整体视角澄清LLMs潜在价值观；（ii）采用'生成式演进评估框架'，通过自适应测试项应对LLMs的演进，并基于现实场景行为直接识别价值观；（iii）提出量化指标，将LLMs与特定价值观的对齐度计算为多维度的加权求和，权重由多元价值观确定。

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Abstract

arXiv:2410.13825v2 Announce Type: replace Abstract: Autonomy via agents using large language models (LLMs) for personalized, standardized tasks boosts human efficiency. Automating web tasks (like booking hotels within a budget) is increasingly sought after. Fulfilling practical needs, the web agent also serves as an important proof-of-concept example for various agent grounding scenarios, with its success promising advancements in many future applications. Prior research often handcrafts web agent strategies (e.g., prompting templates, multi-agent systems, search methods, etc.) and the corresponding in-context examples, which may not generalize well across all real-world scenarios. On the other hand, there has been limited study on the misalignment between a web agent's observation/action representation and the pre-training data of the LLM it's based on. This discrepancy is especially notable when LLMs are primarily trained for language completion rather than tasks involving embodied navigation actions and symbolic web elements. Our study enhances an LLM-based web agent by simply refining its observation and action space to better align with the LLM's capabilities. This approach enables our base agent to significantly outperform previous methods on a wide variety of web tasks. Specifically, on WebArena, a benchmark featuring general-purpose web interaction tasks, our agent AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. We achieve this without using in-context examples, new agent roles, online feedback or search strategies. AgentOccam's simple design highlights LLMs' impressive zero-shot performance on web tasks, and underlines the critical role of carefully tuning observation and action spaces for LLM-based agents.

摘要

通过采用大型语言模型（LLM）的智能体实现自主化，可为个性化、标准化任务提升人类效率。自动化网页任务（如在预算内预订酒店）的需求正持续增长。这类网页智能体既能满足实际需求，也可作为多种智能体落地场景的重要概念验证案例，其成功将推动未来诸多应用的进展。现有研究通常采用手工构建的网页智能体策略（如提示模板、多智能体系统、搜索方法等）及对应的上下文示例，但这些方案在现实场景中普遍缺乏泛化性。另一方面，关于网页智能体的观察/动作表示与底层LLM预训练数据之间的错配问题研究甚少。当LLM主要面向语言补全任务训练，而非涉及具身导航动作和符号化网页元素的任务时，这种差异尤为显著。本研究通过优化智能体的观察空间与动作空间以更好匹配LLM能力，从而提升基于LLM的网页智能体性能。该方法使基础智能体在多样化网页任务上显著超越先前方法。具体而言，在通用网页交互任务基准WebArena上，我们的智能体AgentOccam分别以9.8（+29.4%）和5.9（+15.8%）的绝对优势超越先前最优方法和同期工作，并通过观察-动作空间对齐使成功率较同类基础网页智能体提升26.6个点（+161%）。这一成果未使用上下文示例、新增智能体角色、在线反馈或搜索策略。AgentOccam的简洁设计既彰显了LLM在网页任务上的卓越零样本性能，也凸显了精细调校观察与动作空间对基于LLM智能体的关键作用。

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

Abstract

arXiv:2502.00577v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.

摘要

多模态大语言模型（MLLMs）已展现出良好的性能，但在分布偏移情况下（即评估数据与指令调优分布存在差异时）表现欠佳。尽管已有研究提供了实证评估，但我们认为有必要建立一个能够表征和量化MLLMs风险的理论框架，以确保其在实际应用中的安全性与可靠性。通过采用信息论视角，我们提出了首个理论框架，用于量化分布偏移下MLLMs的最大风险。该框架的核心是引入有效互信息（EMI），这一原则性指标可量化输入查询与模型响应之间的相关性。我们推导了分布内（ID）与分布外（OOD）数据间EMI差异的上界，并将其与视觉和文本分布差异相关联。在涵盖61种偏移场景的真实基准数据集上的大量实验，实证验证了我们的理论发现。

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

Abstract

arXiv:2502.12054v2 Announce Type: replace Abstract: Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.

When More is Less: Understanding Chain-of-Thought Length in LLMs

Abstract

arXiv:2502.07266v2 Announce Type: replace Abstract: Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length's scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference. These findings offer both a principled understanding of the "overthinking" phenomenon and multiple practical guidelines for CoT calibration, enabling LLMs to achieve optimal reasoning performance with adaptive CoTs tailored to task complexity and model capability.

摘要

大型语言模型（LLMs）通过思维链（CoT）推理来解构复杂问题。尽管更长的思维链通常被认为更具优势，但本文挑战了这一观点，指出更长并不总是更好。通过结合现实观察、受控实验和理论分析的综合证据，我们证明任务准确率通常随思维链长度呈现倒U型曲线：随着思维链步骤增加，性能最初提升但最终下降。借助受控实验，我们进一步揭示了最优思维链长度的缩放规律：其随任务难度增加而增长，但随模型能力增强而缩短，这暴露出一种固有的简洁性偏好——能力更强的模型倾向于更短、更高效的思维链推理。这种偏好在强化学习（RL）训练中同样显著，随着模型准确率提升，其会自然趋向更短的思维链。为深入理解这些动态机制，我们建立了一个简易理论模型，严格证明了包括最优长度缩放规律及强化学习中简洁性偏好涌现等现象。在此框架指导下，我们展示了使用最优长度思维链进行训练及在推理时采用长度感知过滤的显著实践效益。这些发现既为"过度思考"现象提供了原理性解释，也为思维链校准提供了多项实用准则，使大型语言模型能通过适配任务复杂度和模型能力的自适应思维链实现最优推理性能。

KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs

Abstract

arXiv:2502.12029v3 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in various complex tasks, yet they still suffer from hallucinations. By incorporating and exploring external knowledge, such as knowledge graphs(KGs), LLM's ability to provide factual answers has been enhanced. This approach carries significant practical implications. However, existing methods suffer from three key limitations: insufficient mining of LLMs' internal knowledge, constrained generation of interpretable reasoning paths, and unclear fusion of internal and external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large model framework driven by the collaboration of internal and external knowledge. It relies on the internal knowledge of the LLM to guide the exploration of interpretable directed subgraphs in external knowledge graphs, better integrating the two knowledge sources for more accurate reasoning. Extensive experiments on multiple real-world datasets demonstrate the effectiveness of KnowPath. Our code and data are available at https://github.com/tize-72/KnowPath.

摘要

大语言模型（LLMs）在各类复杂任务中展现出卓越能力，但仍存在幻觉问题。通过引入并探索知识图谱（KGs）等外部知识，LLMs提供事实性回答的能力得到增强，该方法具有重要实践意义。然而现有方法存在三个关键局限：对LLMs内部知识挖掘不足、可解释推理路径生成受限、内外知识融合机制不清晰。为此，我们提出KnowPath——一个由内外知识协同驱动的知识增强大模型框架。该框架依托LLM内部知识引导外部知识图谱中可解释有向子图的探索，更有效地整合两种知识源以实现精准推理。在多个真实数据集上的大量实验验证了KnowPath的有效性。代码与数据详见https://github.com/tize-72/KnowPath。

Automated Knowledge Component Generation and Knowledge Tracing for Coding Problems

Abstract

arXiv:2502.18632v2 Announce Type: replace Abstract: Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor-intensive. We present a fully automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations on a real-world student code submission dataset. We find that KCGen-KT outperforms existing KT methods and human-written KCs on future student response prediction. We investigate the learning curves of generated KCs and show that LLM-generated KCs result in a better fit than human-written KCs under a cognitive model. We also conduct a human evaluation with course instructors to show that our pipeline generates reasonably accurate problem-KC mappings.

摘要

将知识点（KCs）映射到问题中有助于建模学生学习过程，通过追踪学生对细粒度技能的掌握水平，从而促进在线学习平台中的个性化学习与反馈。然而，传统上由领域专家手动完成的知识点创建与问题标注工作极其耗时。我们提出了一种基于大型语言模型（LLM）的全自动流程，用于开放式编程问题的知识点生成与标注。同时，我们开发了一个基于LLM的知识追踪（KT）框架，以利用这些LLM生成的知识点（简称KCGen-KT）。我们在真实世界的学生代码提交数据集上进行了广泛的定量与定性评估。研究发现，KCGen-KT在学生未来答题预测上优于现有知识追踪方法及人工编写的知识点。我们探究了生成知识点的学习曲线，结果表明在认知模型下，LLM生成的知识点比人工编写的知识点具有更好的拟合效果。此外，我们通过课程教师进行人工评估，证实该流程能生成准确度较高的问题-知识点映射关系。

SMART: Self-Aware Agent for Tool Overuse Mitigation

Abstract

arXiv:2502.11435v2 Announce Type: replace Abstract: Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent's self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART-ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4o. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.

摘要

当前的大型语言模型（LLM）智能体虽展现出强大的推理与工具使用能力，却常缺乏自我认知，难以有效平衡这些方法。这种失衡导致"工具滥用"现象——模型对可通过参数知识解决的任务不必要地依赖外部工具，从而增加计算开销。受人类元认知启发，我们提出SMART（基于策略性模型认知的工具推理）范式，通过增强智能体的自我意识来优化任务处理并减少工具滥用。为支持该范式，我们构建了SMART-ER数据集，涵盖三大领域，其推理过程在参数知识与工具依赖步骤间交替进行，每个步骤均附有解释工具必要性的原理说明。通过监督训练，我们开发出SMARTAgent模型系列，能动态平衡参数知识与工具使用。评估表明，SMARTAgent将工具使用减少24%的同时性能提升超37%，使70亿参数模型达到与700亿参数模型及GPT-4o相当的水平。此外，SMARTAgent在GSM8K和MINTQA等分布外测试数据上展现强泛化能力，仅需五分之一工具调用即可保持准确率。这些成果揭示了策略性工具使用对增强推理、缓解滥用、弥合模型规模与性能差距的潜力，为智能且资源高效的智能体设计提供了新方向。

HPS: Hard Preference Sampling for Human Preference Alignment

Abstract

arXiv:2502.14400v2 Announce Type: replace Abstract: Aligning Large Language Model (LLM) responses with human preferences is vital for building safe and controllable AI systems. While preference optimization methods based on Plackett-Luce (PL) and Bradley-Terry (BT) models have shown promise, they face challenges such as poor handling of harmful content, inefficient use of dispreferred responses, and, specifically for PL, high computational costs. To address these issues, we propose Hard Preference Sampling (HPS), a novel framework for robust and efficient human preference alignment. HPS introduces a training loss that prioritizes the most preferred response while rejecting all dispreferred and harmful ones. It emphasizes "hard" dispreferred responses -- those closely resembling preferred ones -- to enhance the model's rejection capabilities. By leveraging a single-sample Monte Carlo sampling strategy, HPS reduces computational overhead while maintaining alignment quality. Theoretically, HPS improves sample efficiency over existing PL methods and maximizes the reward margin between preferred and dispreferred responses, ensuring clearer distinctions. Experiments on HH-RLHF and PKU-Safety datasets validate HPS's effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation.

摘要

使大语言模型（LLM）响应与人类偏好对齐对于构建安全可控的AI系统至关重要。尽管基于Plackett-Luce（PL）和Bradley-Terry（BT）模型的偏好优化方法展现出潜力，但仍面临诸多挑战：对有害内容处理不佳、对非偏好响应利用效率低下，以及PL方法特有的高计算成本。为解决这些问题，我们提出硬偏好采样（HPS）框架——一种鲁棒高效的人类偏好对齐新方法。HPS通过引入新型训练损失函数，优先选择最受偏好的响应，同时拒绝所有非偏好及有害响应。该方法特别关注"困难"非偏好响应（即与偏好响应高度相似的样本），以增强模型的拒绝能力。通过采用单样本蒙特卡洛采样策略，HPS在保持对齐质量的同时显著降低计算开销。理论分析表明，HPS较现有PL方法提升了采样效率，并最大化偏好与非偏好响应间的奖励边际，确保更清晰的区分度。在HH-RLHF和PKU-Safety数据集上的实验验证了HPS的有效性：在保持相近BLEU分数和奖励分值的同时，大幅提升奖励边际，从而显著减少有害内容生成。

TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding

Abstract

arXiv:2502.19400v2 Announce Type: replace Abstract: Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.

摘要

理解特定领域的定理通常不仅需要基于文本的推理；通过结构化视觉解释进行有效沟通对于深入理解至关重要。尽管大型语言模型（LLM）在基于文本的定理推理中表现出色，但其生成连贯且具有教学意义的视觉解释能力仍是一个未解决的挑战。本研究提出TheoremExplainAgent，一种基于代理的方法，利用Manim动画生成长篇定理解释视频（时长超过5分钟）。为系统评估多模态定理解释，我们构建了TheoremExplainBench基准测试，涵盖多个STEM学科的240条定理，并配套5项自动化评估指标。实验结果表明，代理规划对生成长篇详细视频至关重要，其中o3-mini代理的成功率达93.8%，综合得分为0.77。然而，定量与定性研究表明，大部分生成视频在视觉元素布局上存在细微问题。此外，多模态解释能暴露基于文本的解释所无法揭示的深层次推理缺陷，这凸显了多模态解释的重要性。

Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

Abstract

arXiv:2503.08223v2 Announce Type: replace Abstract: The remarkable success of foundation models has been driven by scaling laws, demonstrating that model performance improves predictably with increased training data and model size. However, this scaling trajectory faces two critical challenges: the depletion of high-quality public data, and the prohibitive computational power required for larger models, which have been monopolized by tech giants. These two bottlenecks pose significant obstacles to the further development of AI. In this position paper, we argue that leveraging massive distributed edge devices can break through these barriers. We reveal the vast untapped potential of data and computational resources on massive edge devices, and review recent technical advancements in distributed/federated learning that make this new paradigm viable. Our analysis suggests that by collaborating on edge devices, everyone can participate in training large language models with small edge devices. This paradigm shift towards distributed training on edge has the potential to democratize AI development and foster a more inclusive AI community.

摘要

基础模型的显著成功源于缩放定律，该定律表明模型性能可随着训练数据和模型规模的增加而可预测地提升。然而，这种扩展路径面临两大关键挑战：高质量公开数据的枯竭，以及更大模型所需的巨额算力（目前被科技巨头垄断）。这两个瓶颈对人工智能的进一步发展构成了重大障碍。在本立场文件中，我们论证利用海量分布式边缘设备可以突破这些壁垒。我们揭示了海量边缘设备上尚未开发的数据与计算资源潜力，并综述了分布式/联邦学习领域的最新技术进步——这些技术使得这一新范式成为可能。分析表明，通过边缘设备协同合作，每个参与者都能使用小型边缘设备参与大语言模型训练。这种向边缘分布式训练的范式转变，有望实现AI发展的民主化，并培育更具包容性的人工智能社区。

ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search

Abstract

arXiv:2504.10893v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test--time compute. However, their application in open--ended, knowledge--intensive, complex reasoning scenarios is still limited. Reasoning--oriented methods struggle to generalize to open--ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge--augmented reasoning (KAR) methods fail to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore--exploit tradeoff arises in multi--branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval--augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state--of--the--art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%. Our project page is at https://opencausalab.github.io/ARise.

摘要

大型语言模型（LLMs）已展现出卓越能力，通过扩展测试时计算来增强其推理性能的研究正受到日益关注。然而，其在开放式、知识密集型复杂推理场景中的应用仍存在局限。面向推理的方法因隐含世界知识完备的假设，难以泛化至开放式场景；而知识增强推理（KAR）方法则面临两个核心挑战：1）错误传播——早期步骤的误差会在推理链中持续累积；2）验证瓶颈——多分支决策过程中探索与利用的权衡问题。为突破这些限制，我们提出ARise框架，该框架将中间推理状态的风险评估与动态检索增强生成（RAG）技术整合至蒙特卡洛树搜索范式，实现对多假设分支推理计划的有效构建与优化。实验结果表明，ARise相较最先进的KAR方法性能提升最高达23.10%，较最新配备RAG的大型推理模型提升最高达25.37%。项目页面详见https://opencausalab.github.io/ARise。

IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery

Abstract

arXiv:2504.16728v2 Announce Type: replace Abstract: The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System

摘要

大语言模型（LLMs）能力的快速提升引发了一个关键问题：LLMs如何加速科学发现？本研究聚焦科研流程的初始阶段——生成新假设。尽管近期关于自动化假设生成的研究主要集中于多智能体框架和扩展测试时计算，但这些方法均未有效整合透明性和可操控性以实现人机协同（HITL）的协同效应。为此，我们提出IRIS：交互式科研构思系统，这是一个开源平台，旨在帮助研究者利用LLM辅助进行科学构思。IRIS整合了多项创新功能以增强构思过程，包括通过蒙特卡洛树搜索（MCTS）实现的自适应测试时计算扩展、细粒度反馈机制以及基于查询的文献综合。该系统设计目标是为研究者提供贯穿整个构思过程的深度掌控与洞察。我们进一步开展了跨学科研究者参与的用户研究，验证了本系统在提升构思效能方面的有效性。代码已开源：https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System

FamilyTool: A Multi-hop Personalized Tool Use Benchmark

Abstract

arXiv:2504.06766v2 Announce Type: replace Abstract: The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real-world personalized scenarios, particularly those requiring multi-hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios. FamilyTool, including base and extended datasets, challenges LLMs with queries spanning from 1 to 4 relational hops (e.g., inferring familial connections and preferences) and 2 to 6 hops respectively, and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re-training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG-augmented evaluation pipeline to systematically assess LLMs' tool use ability in these settings. Experiments reveal significant performance gaps in state-of-the-art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real-world contexts and highlight the urgent need for advancements in tool-learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at \href{https://github.com/yxzwang/FamilyTool}{https://github.com/yxzwang/FamilyTool}.

摘要

工具学习与大型语言模型（LLMs）的整合通过利用外部工具，扩展了其处理复杂任务的能力。然而，现有的工具学习基准未能充分应对现实世界中关键的个性化场景，特别是那些需要在动态环境中进行多跳推理和归纳知识适应的场景。为弥补这一不足，我们提出了FamilyTool，这是一个基于家族知识图谱（KG）的新型基准，模拟了个性化、多跳工具使用场景。FamilyTool包括基础和扩展数据集，通过1至4个关系跳（例如推断家族关系和偏好）和2至6个关系跳的查询挑战LLMs，并引入了一个归纳KG设置，要求模型在不重新训练的情况下适应未见过的用户偏好和关系，这是先前方法中常见的限制，影响了泛化能力。我们进一步提出了KGETool：一个简单的KG增强评估流程，用于系统评估LLMs在这些设置中的工具使用能力。实验表明，最先进的LLMs存在显著的性能差距，随着跳数复杂度的增加，准确性急剧下降，归纳场景暴露了严重的泛化缺陷。这些发现凸显了当前LLMs在处理个性化、不断演变的现实世界场景中的局限性，并强调了工具学习框架亟需改进的紧迫性。FamilyTool作为评估和推动LLM代理在复杂、动态环境中推理、适应性和可扩展性的关键资源。代码和数据集可在https://github.com/yxzwang/FamilyTool获取。

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

Abstract

arXiv:2406.05348v3 Announce Type: replace-cross Abstract: We explore the ability of GPT-4 to perform ad-hoc schema based information extraction from scientific literature. We assess specifically whether it can, with a basic prompting approach, replicate two existing material science datasets, given the manuscripts from which they were originally manually extracted. We employ materials scientists to perform a detailed manual error analysis to assess where the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.

摘要

我们探究了GPT-4在基于临时模式的科学文献信息抽取方面的能力。通过基础提示方法，我们重点评估该模型能否在给定原始人工抽取数据的研究论文基础上，复现两个现有的材料科学数据集。我们聘请材料科学家进行详细的人工错误分析，以评估模型在准确抽取目标信息时的不足之处，并基于其专业见解提出研究方向，以解决这一具有广泛重要性的任务。

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

Abstract

arXiv:2402.05668v3 Announce Type: replace-cross Abstract: Jailbreak attacks aim to bypass the LLMs' safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation -- either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to mitigate by defenses, causing low practicality. Our study offers valuable insights for future research on jailbreak attacks and defenses. We hope our work could help the community avoid incremental work and serve as an effective benchmark tool for practitioners.

摘要

越狱攻击旨在绕过大型语言模型的安全防护机制。尽管研究者已从不同角度深入提出多种越狱攻击方法，但这些研究往往孤立进行——要么在未对齐的设置下展开，要么仅对比有限范围的攻击方法。为填补这一空白，本研究对各类越狱攻击进行了大规模评估。我们收集了17种代表性越狱攻击方法，总结其核心特征并建立了新型越狱攻击分类体系。随后在9个已对齐的大型语言模型上，针对16类违规场景中的160个禁忌问题开展了全面测量与消融实验。同时测试了8种先进防御机制下的攻击有效性。基于分类体系与实验结果，我们发现了若干重要规律，例如启发式攻击虽能实现较高成功率，但因易被防御机制阻断而导致实用性低下。本研究为越狱攻防领域的后续研究提供了重要启示，希望有助于学界避免重复性工作，并为从业者提供有效的基准测试工具。

Model Extrapolation Expedites Alignment

Abstract

arXiv:2404.16792v4 Announce Type: replace-cross Abstract: Given the high computational cost of preference alignment training of large language models (LLMs), exploring efficient methods to reduce the training overhead remains an important and compelling research problem. Motivated by the observation that alignment training typically involves only small parameter changes without injecting new knowledge into models, we propose a straightforward method called ExPO (model extrapolation) to expedite LLMs' alignment with human preferences. Given a partially-trained model and its initial SFT checkpoint, ExPO improves the implicit optimization objective of alignment training by simply amplifying the parameter change based on a first-order approximation, without any additional training overhead. Through controlled experiments, we demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one. Moreover, we show that ExPO notably improves existing open-source LLMs (ranging from 1.8B to 70B parameters) on the leading AlpacaEval 2.0 and MT-Bench benchmarks, which highlights ExPO's broader utility in efficiently enhancing LLM alignment.

摘要

鉴于大型语言模型（LLMs）偏好对齐训练的高计算成本，探索降低训练开销的高效方法仍是一个重要且具有吸引力的研究课题。通过观察发现对齐训练通常仅涉及微小参数变化而不会向模型注入新知识，我们提出了一种名为ExPO（模型外推法）的简洁方法，以加速LLMs与人类偏好的对齐。给定部分训练的模型及其初始监督微调检查点，ExPO基于一阶近似直接放大参数变化来改进对齐训练的隐式优化目标，无需任何额外训练开销。控制实验表明，ExPO能使仅训练20%步数的DPO模型超越完整训练的模型。此外，我们在领先的AlpacaEval 2.0和MT-Bench基准测试中证实，ExPO显著提升了现有开源LLMs（参数规模从18亿到700亿不等）的性能，这凸显了ExPO在高效增强LLM对齐方面的广泛实用性。

Parrot: Multilingual Visual Instruction Tuning

Abstract

arXiv:2406.02539v3 Announce Type: replace-cross Abstract: The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. PARROT achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot

摘要

多模态大语言模型（如GPT-4o）的快速发展标志着通向通用人工智能的重要一步。现有方法通常通过监督微调（SFT）将视觉编码器与大语言模型对齐，但这往往导致模型在多语言处理能力上的退化。我们通过实证研究发现，以英语为主的失衡SFT数据集会因多语言词元对齐失败而降低非英语语言的性能。为此，我们提出PARROT方法——一种利用文本指导实现语言层级视觉词元对齐的新方案。PARROT通过混合专家（MoE）机制将视觉词元与多语言输入条件化关联，通过计算初始视觉特征与文本嵌入的交叉注意力来筛选最相关的专家，从而将视觉词元转换为语言特异性表征。此外，我们构建了包含6种语言、15个类别和12,000个问题的大规模多语言多模态基准（MMMB）用于评估多语言能力。实验表明，PARROT在多语言基准和广泛的多模态任务中均达到最先进性能。代码与数据集详见：https://github.com/AIDC-AI/Parrot

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Abstract

arXiv:2404.01012v3 Announce Type: replace-cross Abstract: Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019 to 2022 deep learning tracks and CAsT-19 and 20 datasets show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.

摘要

查询性能预测（QPP）旨在无需人工相关性判断的情况下，评估搜索系统对查询的检索质量。现有QPP方法通常返回单一标量值，且不要求预测值逼近特定信息检索（IR）评估指标，这导致两个缺陷：（i）单一标量无法准确表征不同IR评估指标，尤其在指标间相关性较低时；（ii）单一标量限制了QPP方法的可解释性，仅凭标量难以阐明预测结果。为解决这些问题，我们提出基于自动生成相关性判断的QPP框架（QPP-GenRE），将QPP分解为预测排序列表中每项内容与查询相关性的独立子任务。该框架允许我们使用生成的相关性判断作为伪标签来预测任意IR评估指标，并能解释预测结果、识别追踪生成相关性判断中的错误以提升QPP质量。我们采用开源大语言模型（LLMs）预测项目相关性以确保科学可复现性。面临两大挑战：（i）为预测涉及召回率的指标需评估整个语料库导致的过高计算成本；（ii）零样本/少样本提示开源LLMs时性能受限。对此，我们设计了面向召回率指标的近似预测策略，并提出基于人工标注相关性判断的LLMs微调方法。在TREC 2019-2022深度学习赛道及CAsT-19/20数据集上的实验表明，QPP-GenRE在词汇和神经排序器上均实现了当前最优的QPP质量。

Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

Abstract

arXiv:2406.05948v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs), especially those accessed via APIs, have demonstrated impressive capabilities across various domains. However, users without technical expertise often turn to (untrustworthy) third-party services, such as prompt engineering, to enhance their LLM experience, creating vulnerabilities to adversarial threats like backdoor attacks. Backdoor-compromised LLMs generate malicious outputs to users when inputs contain specific "triggers" set by attackers. Traditional defense strategies, originally designed for small-scale models, are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS) which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks. It guides the LLM to generate reasoning steps for a given input and scrutinizes for consistency with the final output -- any inconsistencies indicating a potential attack. It is well-suited for the popular API-only LLM deployments, enabling detection at minimal cost and with little data. User-friendly and driven by natural language, it allows non-experts to perform the defense independently while maintaining transparency. We validate the effectiveness of CoS through extensive experiments on various tasks and LLMs, with results showing greater benefits for more powerful LLMs.

摘要

大型语言模型（LLMs），尤其是通过API访问的模型，已在多个领域展现出卓越能力。然而，缺乏技术专业知识的用户常依赖（不可信的）第三方服务（如提示工程）来增强LLM使用体验，这使其易受后门攻击等对抗性威胁。后门植入的LLMs在输入包含攻击者预设的特定"触发器"时，会向用户生成恶意输出。传统防御策略最初为小规模模型设计，由于模型访问受限、计算成本高昂及数据需求等问题，难以适用于API访问的LLMs。为应对这些局限，我们提出"链式审查"（Chain-of-Scrutiny，CoS）方法，利用LLMs独特的推理能力来缓解后门攻击。该方法引导LLM为给定输入生成推理步骤，并审查其与最终输出的一致性——任何不一致均提示潜在攻击。该方案特别适合当前主流的纯API部署模式，能以极低成本和少量数据实现检测。其自然语言驱动的用户友好特性，使得非专家用户可独立执行防御，同时保持透明度。我们通过多任务和多LLMs的广泛实验验证了CoS的有效性，结果表明其对性能更强的LLMs具有更显著优势。

Language Models Benefit from Preparation with Elicited Knowledge

Abstract

arXiv:2409.01345v4 Announce Type: replace-cross Abstract: The zero-shot chain of thought (CoT) approach is often used in question answering (QA) by language models (LMs) for tasks that require multiple reasoning steps. However, some QA tasks hinge more on accessing relevant knowledge than on chaining reasoning steps. We introduce a simple prompting technique, called PREP, that involves using two instances of LMs: the first (LM1) generates relevant information, and the second (LM2) receives the information from the user and answers the question. This design is intended to make better use of the LM's instruction-following capability. PREP is applicable across various QA tasks without domain-specific prompt engineering. PREP is developed on a dataset of 100 QA questions, derived from an extensive schematic dataset specifying artifact parts and material composition. These questions ask which of two artifacts is less likely to share materials with another artifact. Such questions probe the LM's knowledge of shared materials in the part structure of different artifacts. We test our method on our parts-and-materials dataset and three published commonsense reasoning datasets. The average accuracy of our method is consistently higher than that of all the other tested methods across all the tested datasets.

摘要

零样本思维链（CoT）方法常被语言模型（LM）用于需要多步推理的问答（QA）任务。然而，某些QA任务更依赖于获取相关知识而非串联推理步骤。我们提出一种名为PREP的简单提示技术，其采用两个LM实例：首实例（LM1）生成相关信息，次实例（LM2）接收用户信息并回答问题。该设计旨在更好地利用LM的指令遵循能力。PREP适用于各类QA任务，无需领域特定的提示工程。该方法基于100个QA问题的数据集开发，这些问题源自详尽的构件部件与材料组成的图式数据集。这些问题询问两个构件中哪一个较不可能与另一构件共享材料，旨在探究LM对不同构件部件结构中材料共享知识的理解。我们在自建的部件-材料数据集及三个已发表的常识推理数据集上测试本方法。在所有测试数据集中，本方法的平均准确率始终高于其他所有对比方法。

USDC: A Dataset of $\underline{U}$ ser $\underline{S}$ tance and $\underline{D}$ ogmatism in Long $\underline{C}$ onversations

Abstract

arXiv:2406.16833v2 Announce Type: replace-cross Abstract: Analyzing user opinion changes in long conversation threads is extremely critical for applications like enhanced personalization, market research, political campaigns, customer service, targeted advertising, and content moderation. Unfortunately, previous studies on stance and dogmatism in user conversations have focused on training models using datasets annotated at the post level, treating each post as independent and randomly sampling posts from conversation threads. Hence, first, we build a dataset for studying user opinion fluctuations in 764 long multi-user Reddit conversation threads, called USDC. USDC contains annotations for 2 tasks: i) User Stance classification, which involves labeling a user's stance in a post within a conversation on a five-point scale; ii) User Dogmatism classification, which involves labeling a user's overall opinion in the conversation on a four-point scale. Besides being time-consuming and costly, manual annotations for USDC are challenging because: 1) Conversation threads could be very long, increasing the chances of noisy annotations; and 2) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Hence, we leverage majority voting on zero-shot, one-shot, and few-shot annotations from Mistral Large and GPT-4 to automate the annotation process. Human annotations on 200 test conversations achieved inter-annotator agreement scores of 0.49 for stance and 0.50 for dogmatism with these LLM annotations, indicating a reasonable level of consistency between human and LLM annotations. USDC is then used to finetune and instruction-tune multiple deployable small language models like LLaMA, Falcon and Vicuna for the stance and dogmatism classification tasks. We make the code and dataset publicly available [https://github.com/mounikamarreddy/USDC].

摘要

分析长对话线程中用户观点的变化对于增强个性化服务、市场调研、政治竞选、客户服务、定向广告和内容审核等应用至关重要。然而，以往关于用户对话中立场和固执程度的研究主要集中于使用帖子级标注数据集训练模型，将每个帖子视为独立样本并从对话线程中随机抽取。为此，我们首先构建了一个名为USDC的数据集，用于研究764条Reddit多用户长对话线程中的用户观点波动。USDC包含两项任务的标注：i) 用户立场分类，即以五级量表标注用户在对话中某帖子的立场；ii) 用户固执程度分类，即以四级量表标注用户在整段对话中的总体观点倾向。除耗时昂贵外，USDC的人工标注还面临两大挑战：1) 对话线程可能极长，导致标注噪声增加；2) 用户观点转变的实例难以判定，因其往往呈现微妙且非显性的表达。为此，我们采用Mistral Large和GPT-4的零样本、单样本和少样本标注结果进行多数表决，实现标注流程自动化。在200条测试对话中，人工标注与LLM标注的评分者间一致性得分分别为立场0.49、固执程度0.50，表明两者具有合理的一致性。基于USDC，我们对LLaMA、Falcon和Vicuna等多个可部署小语言模型进行了微调和指令调优，以完成立场与固执程度分类任务。代码和数据集已公开[https://github.com/mounikamarreddy/USDC]。

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

Abstract

arXiv:2408.14744v4 Announce Type: replace-cross Abstract: Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1.3 million RS images, each accompanied by two descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at https://github.com/SlytherinGe/RSTeller.

摘要

丰富且标注完善的多模态遥感数据对于将复杂的视觉遥感场景与人类语言对齐至关重要，能够推动针对多样化遥感解译任务的专用视觉语言模型开发。然而，大规模为遥感图像标注富含语言语义的信息需要遥感领域专业知识及大量人力，导致成本高昂且往往难以实现。本研究提出一种创新工作流程，利用大语言模型（LLMs）基于开源OpenStreetMap（OSM）数据，为来自Google Earth Engine（GEE）平台的图像批量生成具有丰富语义描述的多模态数据集。该方法可高效生成配对的遥感数据，并能通过公开可用数据轻松扩展。在此框架下，我们发布RSTeller多模态数据集，包含超过130万幅遥感图像，每幅图像均配有两条描述性文本标注。大量实验表明，通过持续预训练，RSTeller能显著提升多种现有视觉语言模型在遥感场景理解任务中的性能。本方法大幅降低了遥感图像标注所需的人工成本与专业知识门槛，同时促进了高质量标注数据的开放获取。这一进展推动了视觉语言建模的发展，并鼓励更广泛群体参与遥感研究与应用。RSTeller数据集发布于https://github.com/SlytherinGe/RSTeller。

PII-Scope: A Comprehensive Study on Training Data PII Extraction Attacks in LLMs

Abstract

arXiv:2410.06704v2 Announce Type: replace-cross Abstract: In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.

摘要

在本研究中，我们提出了PII-Scope这一综合性基准，旨在评估针对大语言模型（LLMs）的个人身份信息（PII）提取攻击在不同威胁场景下的前沿方法。通过揭示若干关键超参数（如示例选择）对攻击效力的影响，我们的研究深化了对这类攻击的理解。基于此发现，我们将研究拓展至更现实的攻击场景，探索采用高级对抗策略（包括重复多样化查询和迭代学习持续提取）的PII攻击。大量实验结果表明，现有单次查询攻击严重低估了PII泄露风险。事实上，当攻击者具备复杂对抗能力且查询预算有限时，针对预训练模型的PII提取率最高可提升五倍。此外，我们对微调模型的评估显示，其PII泄露脆弱性显著高于预训练模型。本研究为现实威胁场景中的PII提取攻击建立了严谨的实证基准，并为制定有效防御策略奠定了坚实基础。

Identifying Knowledge Editing Types in Large Language Models

Abstract

arXiv:2409.19663v3 Announce Type: replace-cross Abstract: Knowledge editing has emerged as an efficient technique for updating the knowledge of large language models (LLMs), attracting increasing attention in recent years. However, there is a lack of effective measures to prevent the malicious misuse of this technique, which could lead to harmful edits in LLMs. These malicious modifications could cause LLMs to generate toxic content, misleading users into inappropriate actions. In front of this risk, we introduce a new task, $\textbf{K}$ nowledge $\textbf{E}$ diting $\textbf{T}$ ype $\textbf{I}$ dentification (KETI), aimed at identifying different types of edits in LLMs, thereby providing timely alerts to users when encountering illicit edits. As part of this task, we propose KETIBench, which includes five types of harmful edits covering the most popular toxic types, as well as one benign factual edit. We develop five classical classification models and three BERT-based models as baseline identifiers for both open-source and closed-source LLMs. Our experimental results, across 92 trials involving four models and three knowledge editing methods, demonstrate that all eight baseline identifiers achieve decent identification performance, highlighting the feasibility of identifying malicious edits in LLMs. Additional analyses reveal that the performance of the identifiers is independent of the reliability of the knowledge editing methods and exhibits cross-domain generalization, enabling the identification of edits from unknown sources. All data and code are available in https://github.com/xpq-tech/KETI.

摘要

知识编辑作为一种更新大型语言模型（LLMs）知识的高效技术，近年来受到越来越多的关注。然而，目前缺乏有效手段防止该技术的恶意滥用，这可能导致对LLMs进行有害编辑。此类恶意修改会使LLMs生成有毒内容，误导用户采取不当行为。针对这一风险，我们提出新任务——知识编辑类型识别（KETI），旨在识别LLMs中不同类型的编辑，从而在遭遇非法编辑时为用户提供及时预警。作为该任务组成部分，我们构建了KETIBench基准数据集，涵盖五种最流行的有害编辑类型及一类良性事实编辑。我们开发了五种经典分类模型和三种基于BERT的模型作为基线识别器，适用于开源和闭源LLMs。通过在四个模型和三种知识编辑方法上进行的92组实验，结果表明所有八种基线识别器均取得良好识别性能，证实了识别LLMs恶意编辑的可行性。进一步分析表明，识别器性能与知识编辑方法的可靠性无关，且具有跨领域泛化能力，可识别未知来源的编辑。所有数据与代码已开源：https://github.com/xpq-tech/KETI。

Policy Filtration for RLHF to Mitigate Noise in Reward Models

Abstract

arXiv:2409.06957v3 Announce Type: replace-cross Abstract: While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination (R2) between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real performance. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation and math reasoning tasks. In code generation, PF-PPO achieves the state-of-the-art performance of 7-billion-parameter models on HumanEval (+7.9%), MBPP (+0.7%), and LeetCode Contest (+10.0%) which is a more challenging benchmark created by us. In math reasoning, PF-PPO yields performance increase using different reward models and benchmarks (Ape210K and CMATH). Code is available on https://github.com/swtheing/PF-PPO-RLHF.

摘要

虽然存在直接策略优化方法，但前沿的大语言模型（LLM）通常通过人类反馈强化学习（RLHF）进行微调，以在基于偏好数据学习的奖励模型监督下生成更优响应。RLHF面临的主要挑战在于中间奖励模型的不准确性，尤其是当任务需要复杂推理才能使奖励模型对响应进行评分时。我们发现奖励模型的可靠性会因响应所获奖励值的不同而存在差异，这促使我们通过过滤奖励可能不可靠的样本来提升策略学习过程中的信噪比，从而提出了近端策略优化的策略过滤方法（PF-PPO）。为选择合适的策略过滤方案，我们采用决定系数（R²）作为衡量指标——该系数能反映经PF-PPO过滤后的奖励与实际得分的相关性程度，从而帮助我们筛选有效策略。通过大量实验，我们验证了PF-PPO在代码生成和数学推理任务中的有效性。在代码生成方面，PF-PPO使70亿参数模型在HumanEval（+7.9%）、MBPP（+0.7%）以及我们创建的更具挑战性的LeetCode Contest（+10.0%）基准测试中达到当前最优性能。在数学推理任务中，PF-PPO在不同奖励模型和基准测试（Ape210K与CMATH）上均实现了性能提升。代码已开源于https://github.com/swtheing/PF-PPO-RLHF。

Stuffed Mamba: Oversized States Lead to the Inability to Forget

Abstract

arXiv:2410.07145v2 Announce Type: replace-cross Abstract: Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to "forget" earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.

摘要

最近，以Mamba和RWKV为代表的循环架构在语言任务中展现出强大性能。与基于Transformer的模型不同，这些架构将所有上下文信息编码为固定大小的状态，从而实现了高效的推理能力。然而，这种方法可能导致信息干扰——不同标记数据相互冲突，当上下文长度超过特定阈值时，会造成性能下降和输出不连贯。为防止这一问题，大多数循环神经网络都设计了"遗忘"早期标记的机制。本文揭示：即使内置遗忘机制，基于Mamba的模型仍难以有效遗忘早期标记。我们证明该问题源于训练上下文长度相对于状态规模过短，使得模型无需学习遗忘机制即可表现良好。进一步研究表明，模型学习遗忘所需的最小训练长度与状态规模呈线性关系，而准确检索5位数密码的最大上下文长度与状态规模呈指数关系，这表明模型在开始遗忘后仍保留部分信息。这些发现揭示了当前循环架构的关键局限，为改进长上下文建模提供了重要见解。我们的研究表明，未来循环神经网络设计必须综合考虑状态规模、训练长度与遗忘机制的相互作用，才能在长上下文任务中实现稳健性能。

Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up

Abstract

arXiv:2410.12323v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning. Existing methods to improve LLMs' logical capabilities either involve traceable or verifiable logical sequences that generate more reliable responses by constructing logical structures yet increase computational costs, or introduces rigid logic template rules, reducing flexibility. In this paper, we propose Reversal of Thought (RoT), a plug-and-play and cost-effective reasoning framework designed to enhance the logical reasoning abilities of LLMs during the warm-up phase prior to batch inference. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs' cognitive preferences shaped by RLHF. Through reverse reasoning, we utilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs' reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.

摘要

大型语言模型（LLMs）在推理任务中展现出卓越性能，但在数学和复杂逻辑推理方面仍存在局限。现有提升LLMs逻辑能力的方法可分为两类：一类通过构建可追踪或可验证的逻辑序列生成更可靠响应，但会增加计算成本；另一类引入刚性逻辑模板规则，降低了灵活性。本文提出"逆向思维"（RoT）框架，这是一种即插即用、高性价比的推理方案，旨在批量推理前的预热阶段增强LLMs的逻辑推理能力。RoT采用"偏好引导逆向推理"预热策略，通过元认知机制整合逻辑符号进行伪代码规划，并利用成对偏好自评估仅通过示例生成任务特定提示，从而契合LLMs经RLHF塑造的认知偏好。借助逆向推理，我们使用"认知偏好管理器"评估知识边界：对已知任务聚合解决方案逻辑，对未知任务整合风格模板，从而扩展LLMs的推理能力。多项任务实验表明，RoT在推理准确性和效率上均超越现有基线方法。

SynapticRAG: Enhancing Temporal Memory Retrieval in Large Language Models through Synaptic Mechanisms

Abstract

arXiv:2410.13553v2 Announce Type: replace-cross Abstract: Existing retrieval methods in Large Language Models show degradation in accuracy when handling temporally distributed conversations, primarily due to their reliance on simple similarity-based retrieval. Unlike existing memory retrieval methods that rely solely on semantic similarity, we propose SynapticRAG, which uniquely combines temporal association triggers with biologically-inspired synaptic propagation mechanisms. Our approach uses temporal association triggers and synaptic-like stimulus propagation to identify relevant dialogue histories. A dynamic leaky integrate-and-fire mechanism then selects the most contextually appropriate memories. Experiments on four datasets of English, Chinese and Japanese show that compared to state-of-the-art memory retrieval methods, SynapticRAG achieves consistent improvements across multiple metrics up to 14.66% points. This work bridges the gap between cognitive science and language model development, providing a new framework for memory management in conversational systems.

摘要

现有大型语言模型中的检索方法在处理时间分布型对话时存在准确性下降的问题，这主要源于其对简单相似性检索的依赖。与仅依赖语义相似度的传统记忆检索方法不同，我们提出SynapticRAG，该方法创新性地将时间关联触发器与受生物启发的突触传播机制相结合。我们的方法通过时间关联触发器和类突触刺激传播来识别相关对话历史，随后采用动态漏积分发放机制选择上下文最适配的记忆。在英语、汉语和日语四个数据集上的实验表明，相较于最先进的记忆检索方法，SynapticRAG在多项指标上实现了最高达14.66个百分点的稳定提升。该研究弥合了认知科学与语言模型开发之间的鸿沟，为对话系统中的记忆管理提供了新框架。

Conformity in Large Language Models

Abstract

arXiv:2410.12428v2 Announce Type: replace-cross Abstract: The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in popular LLMs. Our findings reveal that all tested models exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions, Devil's Advocate and Question Distillation, to mitigate conformity, providing insights into building more robust language models.

摘要

从众效应描述了个人倾向于使其反应与多数人保持一致的现象。研究大型语言模型（LLMs）中的这种偏差至关重要，因为LLMs正日益作为对话伙伴被用于各种信息检索和决策任务以提高生产力。因此，对错误回答的从众行为会损害其有效性。本文通过改编心理学实验来检验主流LLMs的从众程度。研究发现，所有测试模型在不同知识领域均表现出不同程度的从众倾向，无论其初始选择或答案正确与否。值得注意的是，我们首次证明当LLMs对自身预测更不确定时，其从众可能性更高。我们进一步探究了影响从众的因素，如训练范式和输入特征，发现经过指令微调的模型较不易从众，而提高多数意见表述的自然度则会增强从众行为。最后，我们提出"魔鬼代言人"和"问题蒸馏"两种干预措施来缓解从众效应，为构建更稳健的语言模型提供了见解。

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

Abstract

arXiv:2410.10700v2 Announce Type: replace-cross Abstract: Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.

摘要

大型语言模型（LLMs）的安全问题因其在预训练阶段可能接触有害数据而受到广泛关注。本文揭示了一种新的安全漏洞：LLMs易受攻击提示与原始有害提示之间自然分布偏移的影响，即语义上与有害内容相关但表面无害的提示可能绕过安全机制。为探究该问题，我们提出新型攻击方法ActorBreaker，该方法通过识别预训练分布中与有害提示相关的行为体，构建多轮提示逐步诱导LLMs生成不安全内容。ActorBreaker基于拉图尔的行为者网络理论，涵盖人类与非人类行为体以捕捉更广泛的漏洞。实验结果表明，在对齐的LLMs上，ActorBreaker在多样性、有效性和效率方面均优于现有攻击方法。针对此漏洞，我们建议扩展安全训练以覆盖更广泛的有害内容语义空间，并利用ActorBreaker构建了多轮安全数据集。基于该数据集的微调显著提升了模型鲁棒性，但会带来一定的效用损失。代码详见https://github.com/AI45Lab/ActorAttack。

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Abstract

arXiv:2410.20050v2 Announce Type: replace-cross Abstract: Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called \textbf{S}elf-\textbf{L}earning \textbf{Hy}pothetical \textbf{D}ocument \textbf{E}mbeddings (\textbf{SL-HyDE}) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses HyDE in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. Our code and data are publicly available at: https://github.com/ll0ruc/AutoMIR

摘要

医学信息检索（MIR）对于从电子健康记录、科学文献和医学数据库等多种来源检索相关医学知识至关重要。然而，由于缺乏相关性标注数据，在医学领域实现有效的零样本密集检索面临重大挑战。本文提出了一种名为自学习假设文档嵌入（SL-HyDE）的新方法来解决这一问题。SL-HyDE利用大型语言模型（LLM）作为生成器，基于给定查询生成假设文档。这些生成的文档囊括了关键医学背景信息，可指导密集检索器识别最相关的文档。该自学习框架逐步优化伪文档生成和检索过程，仅利用未标注的医学语料库，无需任何相关性标注数据。此外，我们提出了中文医学信息检索基准（CMIRB），这是一个基于真实医学场景的综合评估框架，包含五项任务和十个数据集。通过在CMIRB上对十个模型进行基准测试，我们为评估医学信息检索系统建立了严格标准。实验结果表明，SL-HyDE在检索准确率上显著超越HyDE，同时在不同LLM和检索器配置下展现出强大的泛化能力和可扩展性。我们的代码和数据公开于：https://github.com/ll0ruc/AutoMIR

Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models

Abstract

arXiv:2411.02083v2 Announce Type: replace-cross Abstract: While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the Lp norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives. The code is available via: https://tum-ai.github.io/number-token-loss/

摘要

虽然语言模型在文本生成方面具有卓越能力，但其缺乏处理数字的自然归纳偏置，因此在涉及定量推理（尤其是算术）的任务中表现欠佳。一个根本性限制在于交叉熵损失函数的本质，该函数假设名义尺度，无法传递生成数字标记之间的邻近关系。为此，我们提出一种纯标记层面的类回归损失函数。我们设计的数字标记损失（NTL）包含两种形式：通过最小化Lp范数或真实值与预测数字标记数值之间的Wasserstein距离来实现。NTL可轻松集成至任何语言模型，在训练过程中扩展交叉熵目标且无需额外运行时开销。我们在多个数学数据集上评估该方案，发现其能持续提升数学相关任务的性能。在回归任务的直接对比中，NTL尽管在标记层面操作，仍能达到回归头的性能水平。最后，我们将NTL扩展至30亿参数模型并观察到性能提升，证明其可无缝集成至大型语言模型的潜力。希望本研究能启发LLM开发者改进预训练目标。

Abstract

arXiv:2411.01271v2 Announce Type: replace-cross Abstract: This paper discusses the theory and algorithms for interacting large language model agents (LLMAs) using methods from statistical signal processing and microeconomics. While both fields are mature, their application to decision-making involving interacting LLMAs remains unexplored. Motivated by Bayesian sentiment analysis on online platforms, we construct interpretable models and algorithms that enable LLMAs to interact and perform Bayesian inference. Because interacting LLMAs learn from both prior decisions and external inputs, they can exhibit bias and herding behavior. Thus, developing interpretable models and stochastic control algorithms is essential to understand and mitigate these behaviors. This paper has three main results. First, we show using Bayesian revealed preferences from microeconomics that an individual LLMA satisfies the necessary and sufficient conditions for rationally inattentive (bounded rationality) Bayesian utility maximization and, given an observation, the LLMA chooses an action that maximizes a regularized utility. Second, we utilize Bayesian social learning to construct interpretable models for LLMAs that interact sequentially with each other and the environment while performing Bayesian inference. Our proposed models capture the herding behavior exhibited by interacting LLMAs. Third, we propose a stochastic control framework to delay herding and improve state estimation accuracy under 2 settings: (a) centrally controlled LLMAs (b) autonomous LLMAs with incentives. We demonstrate the effectiveness of our methods on real datasets for hate speech classification and product quality assessment, using open-source models like LLaMA and closed-source models like ChatGPT. The main takeaway of this paper, based on empirical analysis and mathematical formalism, is that LLMAs act as rationally bounded Bayesian agents that exhibit social learning when interacting.

摘要

本文探讨了运用统计信号处理与微观经济学方法实现大型语言模型智能体（LLMAs）交互的理论与算法。尽管这两个领域已发展成熟，但其在LLMAs交互决策中的应用尚未得到探索。受在线平台贝叶斯情感分析的启发，我们构建了可解释的模型与算法，使LLMAs能够进行交互并执行贝叶斯推断。由于交互的LLMAs会从先验决策和外部输入中学习，它们可能表现出偏见和从众行为。因此，开发可解释的模型和随机控制算法对于理解与缓解这些行为至关重要。本文取得三个主要成果：首先，我们运用微观经济学中的贝叶斯显示偏好理论证明，单个LLMA满足理性疏忽（有限理性）贝叶斯效用最大化的充要条件，且在给定观测值时，LLMA会选择能最大化正则化效用的行动；其次，我们利用贝叶斯社会学习理论，为连续交互并执行贝叶斯推断的LLMAs构建可解释模型，该模型能捕捉交互LLMAs表现出的从众行为；第三，我们提出随机控制框架以延缓从众现象，并在两种场景下提升状态估计精度：（a）中心化控制的LLMAs（b）具有激励机制的自主LLMAs。通过LLaMA等开源模型和ChatGPT等闭源模型，我们在仇恨言论分类和产品质量评估的真实数据集上验证了方法的有效性。基于实证分析与数学形式化，本文的核心结论是：LLMAs在交互时表现为具有社会学习特性的理性有限贝叶斯智能体。

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers

Abstract

arXiv:2411.14507v2 Announce Type: replace-cross Abstract: Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains, largely due to the extensive scaling of model parameters. Recent works have observed redundancy within transformer blocks and developed compression methods by structured pruning of less important blocks. However, such direct removal often leads to irreversible performance degradation. In this paper, we propose FuseGPT, a novel methodology designed to recycle pruned transformer blocks, thereby recovering the model's performance. Firstly, we introduce a new importance detection metric, Macro Influence (MI), which evaluates the long-term impact of each transformer block by quantifying the information loss incurred upon its removal. Next, we propose group-level layer fusion, which leverages the parameters from layers of less important blocks and integrates them into the corresponding layers of neighboring blocks. This fusion process is not a one-time operation but is refined through iterative parameter updates by lightweight group-level fine-tuning. Specifically, the injected parameters are frozen but are weighted with learnable rank decomposition matrices to reduce the computational overhead during fine-tuning. Our approach not only works well for large language models but also for large multimodal models. Experimental results indicate that, even with modest amounts of data, FuseGPT surpasses previous methods in both perplexity and zero-shot task performance.

摘要

生成式预训练变换器（GPT）凭借模型参数的大规模扩展，在多个领域展现出卓越性能。近期研究发现变换器块中存在冗余，并通过结构化剪枝去除次要块来开发压缩方法。然而，这种直接移除常导致不可逆的性能下降。本文提出FuseGPT，一种创新方法旨在回收被剪枝的变换器块以恢复模型性能。首先，我们提出新的重要性检测指标——宏观影响力（MI），通过量化移除块时产生的信息损失来评估各变换器块的长期影响。其次，我们提出组级层融合技术，利用次要块的层参数并将其整合至相邻块的对应层中。该融合过程并非一次性操作，而是通过轻量级组级微调进行迭代参数更新优化。具体而言，注入参数被冻结但通过可学习的秩分解矩阵加权，以降低微调时的计算开销。本方法不仅适用于大语言模型，对大规模多模态模型同样有效。实验结果表明，即使使用少量数据，FuseGPT在困惑度和零样本任务性能上均超越现有方法。

Rethinking Chain-of-Thought from the Perspective of Self-Training

Abstract

arXiv:2412.10827v4 Announce Type: replace-cross Abstract: Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, \ie over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.

摘要

思维链（CoT）推理已成为激活大语言模型潜在能力的有效方法。有趣的是，我们观察到CoT推理与自训练共享核心目标：通过迭代利用模型生成的信息来逐步降低预测不确定性。基于这一发现，我们提出了一种新型CoT框架以提升推理性能。该框架包含两个关键组件：（1）任务特定提示模块，用于优化初始推理过程；（2）自适应推理迭代模块，可动态优化推理流程并解决传统CoT方法的局限性（如过度推理和连续迭代间高相似性问题）。大量实验表明，所提方法在性能和计算效率方面均具有显著优势。

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

Abstract

arXiv:2412.07282v2 Announce Type: replace-cross Abstract: This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP provides insights into the potential of adaptive computation for enhancing the performance of Transformer-based language models.

摘要

本文旨在通过解决推理步骤中计算需求可变的问题来提升大型语言模型的性能，其中某些标记需要比其他标记更多的计算资源。我们提出HARP方法，这是一种对"现成"Transformer前向传播过程的简单修改。借鉴决策过程中的犹豫和框架效应，HARP在模型生成标记遇到不确定性时选择性地施加额外计算。我们的方法通过在人造困难决策点暂停并重构输入以获得不同视角，模拟了人类认知过程。与其他方法不同，HARP具有模型无关性、无需训练且易于实现的特性。我们在多种下游任务和模型规模上评估了该方法，结果显示性能提升最高达+5.16%。值得注意的是，HARP在保持推理速度比束搜索快两倍的同时实现了这些提升。该方法简单却收效显著，为基于Transformer的语言模型性能提升提供了自适应计算潜力的新见解。

AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

Abstract

arXiv:2412.16594v3 Announce Type: replace-cross Abstract: While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

摘要

虽然大语言模型为软件开发提供了极大便利，但它们可能在求职面试和学生作业中引发伦理问题。因此，判定代码是由人类编写还是由人工智能（AI）模型生成成为一个关键议题。本研究提出了AIGCodeSet数据集，包含使用CodeLlama 34B、Codestral 22B和Gemini 1.5 Flash生成的2,828份AI代码及4,755份人类编写的Python代码。此外，我们还分享了采用基线检测方法开展的实验结果。实验表明，贝叶斯分类器的检测性能优于其他模型。

Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings

Abstract

arXiv:2412.13879v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks yet still are vulnerable to external threats, particularly LLM Denial-of-Service (LLM-DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust computational resources and block services. However, existing studies predominantly focus on white-box attacks, leaving black-box scenarios underexplored. In this paper, we introduce Auto-Generation for LLM-DoS (AutoDoS) attack, an automated algorithm designed for black-box LLMs. AutoDoS constructs the DoS Attack Tree and expands the node coverage to achieve effectiveness under black-box conditions. By transferability-driven iterative optimization, AutoDoS could work across different models in one prompt. Furthermore, we reveal that embedding the Length Trojan allows AutoDoS to bypass existing defenses more effectively. Experimental results show that AutoDoS significantly amplifies service response latency by over 250 $\times\uparrow$ , leading to severe resource consumption in terms of GPU utilization and memory usage. Our work provides a new perspective on LLM-DoS attacks and security defenses. Our code is available at https://github.com/shuita2333/AutoDoS.

摘要

大型语言模型（LLMs）已在多样化任务中展现出卓越性能，但仍易受外部威胁影响，尤其是LLM拒绝服务（LLM-DoS）攻击。此类攻击旨在耗尽计算资源并阻断服务，然而现有研究主要集中于白盒攻击，对黑盒场景的探索不足。本文提出面向黑盒LLMs的自动化算法AutoDoS攻击，通过构建DoS攻击树并扩展节点覆盖范围，实现在黑盒条件下的高效攻击。基于可迁移性的迭代优化使AutoDoS能在单一提示中跨模型生效。此外，我们发现嵌入长度木马可使AutoDoS更有效绕过现有防御机制。实验表明，AutoDoS能将服务响应延迟显著提升250倍以上，导致GPU利用率和内存占用等资源消耗急剧增加。本研究为LLM-DoS攻击与安全防御提供了新视角。代码已开源：https://github.com/shuita2333/AutoDoS。

EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents

Abstract

arXiv:2412.13549v2 Announce Type: replace-cross Abstract: Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies.

摘要

语言模型代理在长会话规划与推理方面表现卓越，但现有基准主要关注目标明确的任务，忽视了陌生环境中的创造性适应能力。为此，我们提出EscapeBench——一套密室逃脱游戏环境基准测试，旨在通过创造性推理、非常规工具使用和迭代式问题解决来发现隐含目标，从而挑战代理能力。实验表明，当前语言模型即便采用工作记忆和思维链推理技术，在无提示情况下平均进度仅为15%，暴露出其创造力局限。为弥补这一差距，我们设计了EscapeAgent框架，通过"前瞻"（创新性工具使用）和"反思"（识别未解决任务）来增强创造性推理。实验证明，EscapeAgent能执行超过1000步的动作链并保持逻辑连贯性，其通关步骤和提示需求减少达40%，在不同难度级别均表现稳健，并以更高效创新的解谜策略实现更高动作成功率。

Each Graph is a New Language: Graph Learning with LLMs

Abstract

arXiv:2501.11478v3 Announce Type: replace-cross Abstract: Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose \textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge \textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs.

GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration

Abstract

arXiv:2412.16216v2 Announce Type: replace-cross Abstract: The sparse Mixture-of-Experts (MoE) architecture of large language models (LLMs) confronts an inherent issue of load imbalance arising from the simplistic linear router strategy, which ultimately causes the instability and inefficient learning of LLMs. To address this challenge, we introduce a novel MoE graph-based framework $\textbf{GMoE}$ , aimed at enhancing the collaboration among multiple experts. In GMoE, a graph router function is designed to capture the collaboration signals among experts. This enables all experts to dynamically allocate information derived from input data by sharing information with their neighboring experts. Moreover, we put forward two coordination strategies in GMoE: the $\textit{Poisson distribution-based distinction strategy}$ and the $\textit{Normal distribution-based balance strategy}$ , to further release the capacity of each expert and increase the model stability in the fine-tuning of LLMs. Specifically, we leverage a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation (LoRA), to implement the graph MoE architecture. Extensive experiments on four real-world benchmark datasets demonstrate the effectiveness of GMoE, showing the benefits of facilitating collaborations of multiple experts in LLM fine-tuning. The code of experimental implementation is available at https://github.com/BAI-LAB/GMoE

摘要

大语言模型（LLMs）的稀疏混合专家（MoE）架构面临一个由简单线性路由策略引起的固有负载不均衡问题，这最终导致LLMs的不稳定和低效学习。为应对这一挑战，我们提出了一种基于图的新型MoE框架 $\textbf{GMoE}$ ，旨在增强多个专家之间的协作。在GMoE中，设计了一个图路由函数来捕捉专家间的协作信号，使所有专家能够通过与相邻专家共享信息，动态分配来自输入数据的信息。此外，我们在GMoE中提出了两种协调策略：基于泊松分布的区分策略和基于正态分布的平衡策略，以进一步释放每个专家的能力，并提高LLM微调中的模型稳定性。具体而言，我们采用了一种参数高效的微调技术——低秩自适应（LoRA）来实现图MoE架构。在四个真实世界基准数据集上的大量实验证明了GMoE的有效性，展示了在LLM微调中促进多专家协作的优势。实验实现代码可在https://github.com/BAI-LAB/GMoE获取。

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Abstract

arXiv:2501.09766v4 Announce Type: replace-cross Abstract: Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data, and it can not equip the model with advanced tool-use capabilities in complex scenarios. Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model's deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.

摘要

通过外部工具增强大语言模型（LLM）的能力是提升其处理复杂任务的有效途径，而通过真实场景模拟合成工具使用数据是实现这一目标的重要手段。然而，我们的研究发现，随着合成数据量的增加，训练收益会显著衰减。模型难以从更多合成数据中获益，且无法在复杂场景下获得先进的工具使用能力。进一步分析表明，上述局限通常表现为响应中的片段缺失（即参数错误）。为此，我们提出了一种迭代强化微调策略：首先通过蒙特卡洛树搜索的路径探索增强合成数据响应的多样性；其次通过构建细粒度偏好对迭代定位模型缺陷，并采用偏好优化算法进行针对性改进。实验表明，本方法相较同规模基线模型性能提升13.11%，在复杂场景下较基线提升6.5%，且优于更大规模的开源与闭源模型。

How to Synthesize Text Data without Model Collapse?

Abstract

arXiv:2412.14689v2 Announce Type: replace-cross Abstract: Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT- $\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves model performance.

摘要

合成数据中的模型崩溃现象表明，对自生成数据进行迭代训练会导致性能逐渐下降。随着人工智能模型的激增，合成数据将从根本上重塑网络数据生态系统。未来的GPT- $\{n\}$ 模型将不可避免地接受合成数据与人类生成数据的混合训练。本文聚焦两个核心问题：合成数据对语言模型训练有何影响？如何合成数据才能避免模型崩溃？我们首先在不同比例的合成数据上预训练语言模型，发现合成数据比例与模型性能呈负相关。进一步通过统计分析揭示合成数据存在分布偏移现象及n元语法特征过度集中的问题。基于上述发现，我们提出对人类生成数据进行词元编辑以获取半合成数据。作为概念验证，我们理论证明词元级编辑能通过约束测试误差有限上界来防止模型崩溃。在从头预训练、持续预训练和监督微调三个场景下的实验均验证了理论结论：词元级编辑能有效提升模型性能。

A partition cover approach to tokenization

Abstract

arXiv:2501.06246v2 Announce Type: replace-cross Abstract: Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$ -approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

摘要

摘要：分词是将字符串编码为固定词汇表大小的标记的过程，在自然语言处理应用中广泛使用。当前主流的分词算法是字节对编码（BPE），其将分词问题表述为压缩问题，并通过执行一系列合并操作来解决。在本研究中，我们将分词表述为一个优化目标，通过从顶点覆盖问题的简单归约证明其NP难性，并提出一种多项式时间的贪心算法GreedTok。我们的表述自然地松弛为已被深入研究的加权最大覆盖问题，该问题具有简单的 $(1 - 1/e)$ 近似算法GreedWMC。通过对真实语料库的实证评估，我们发现GreedTok在压缩性能上优于BPE和Unigram算法，并实现了与GreedWMC相当的覆盖分数。最后，我们针对两个具有10亿参数的基于Transformer的语言模型进行了广泛的预训练实验，比较BPE和GreedTok作为分词器的选择，结果表明即使控制总数据集比例或总训练标记数量，GreedTok仍能实现更低的每字节比特率。

The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation

Abstract

arXiv:2501.07849v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have emerged as the new recommendation engines, surpassing traditional methods in both capability and scope, particularly in code generation. In this paper, we reveal a novel provider bias in LLMs: without explicit directives, these models show systematic preferences for services from specific providers in their recommendations (e.g., favoring Google Cloud over Microsoft Azure). To systematically investigate this bias, we develop an automated pipeline to construct the dataset, incorporating 6 distinct coding task categories and 30 real-world application scenarios. Leveraging this dataset, we conduct the first comprehensive empirical study of provider bias in LLM code generation across seven state-of-the-art LLMs, utilizing approximately 500 million tokens (equivalent to $5,000+ in computational costs). Our findings reveal that LLMs exhibit significant provider preferences, predominantly favoring services from Google and Amazon, and can autonomously modify input code to incorporate their preferred providers without users' requests. Such a bias holds far-reaching implications for market dynamics and societal equilibrium, potentially contributing to digital monopolies. It may also deceive users and violate their expectations, leading to various consequences. We call on the academic community to recognize this emerging issue and develop effective evaluation and mitigation methods to uphold AI security and fairness.

摘要

大语言模型（LLMs）已成为新型推荐引擎，在代码生成等领域的能力和范围已超越传统方法。本文揭示LLMs中存在一种新型供应商偏见：在没有明确指令时，这些模型在推荐中会系统性地偏好特定供应商的服务（例如更倾向于推荐谷歌云而非微软Azure）。为系统研究该偏见，我们开发了自动化流程构建数据集，涵盖6个不同编码任务类别和30个真实应用场景。基于该数据集，我们首次对7个最先进LLM的代码生成供应商偏见进行了全面实证研究，消耗约5亿token（相当于5000美元以上的计算成本）。研究发现：LLMs表现出显著的供应商偏好，主要倾向于谷歌和亚马逊的服务，并能自主修改输入代码以加入其偏好的供应商（无需用户请求）。这种偏见对市场格局和社会均衡具有深远影响，可能助长数字垄断；同时可能欺骗用户并违背其预期，引发多重后果。我们呼吁学术界重视这一新兴问题，开发有效的评估与缓解方法以维护AI安全与公平。

NExtLong: Toward Effective Long-Context Training without Long Documents

Abstract

arXiv:2501.12766v2 Announce Type: replace-cross Abstract: Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.

摘要

具有扩展上下文窗口的大型语言模型（LLMs）虽已取得显著进展，但由于长文档的稀缺性，仍面临挑战。现有方法倾向于合成长上下文数据，但缺乏明确的机制来增强长程依赖建模。为应对这一局限，我们提出NExtLong框架，通过负文档扩展合成长上下文数据。该框架将文档分解为多个元块，并通过交错插入从预训练语料库中检索的困难负干扰项来扩展上下文。这种方法迫使模型从干扰内容中区分长程依赖上下文，从而增强其建模长程依赖的能力。大量实验表明，相较于现有的长上下文合成方法和基于非合成长文档训练的领先模型，NExtLong在HELMET和RULER基准测试上均实现了显著性能提升。这些发现凸显了NExtLong降低对非合成长文档依赖的能力，使其成为开发先进长上下文LLMs的有效框架。

Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation

Abstract

arXiv:2501.12432v2 Announce Type: replace-cross Abstract: Although current Large Language Models (LLMs) exhibit impressive capabilities, performing complex real-world tasks still requires tool learning. Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to interact with external environments, but they are limited in perceptual scope and lack adequate task-planning capability. To address these limitations, other studies introduce the first Search-based Decision Tree (DFSDT), which still suffers from the high computational cost. In this paper, we introduce a novel parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama). First, we transform traditional tree-based tool search paths into Directed Acyclic Graph (DAG) structure, generating a high-quality parallel tool invocation dataset. The DTA-Llama is then trained on the dataset to learn to iteratively divide the current task into several parallel tool invocation sub-tasks and aggregate the invocation results to decide the next actions. Furthermore, we introduce an efficient inference framework inspired by the Process/Threads mechanism when applying the DTA-Llama to practical tasks. Experimental results show that our approach substantially enhances task performance while reducing token consumption and inference time. Llama2-7B, using our method, is comparable to the official parallel function calling method of GPT-3.5. The relevant code, dataset, and model weights are available at https://corn0205.github.io/

摘要

尽管当前的大型语言模型（LLMs）展现出令人印象深刻的能力，但执行复杂的现实任务仍需工具学习。主流方法（如CoT/ReAct）依赖逐步工具调用来与外部环境交互，但其感知范围有限且缺乏充分的任务规划能力。为解决这些局限性，已有研究引入首个基于搜索的决策树（DFSDT），但仍存在计算成本过高的问题。本文提出一种新颖的并行工具调用范式DTA-Llama（分治聚合型Llama）。首先，我们将传统的树状工具搜索路径转化为有向无环图（DAG）结构，生成高质量的并行工具调用数据集。随后基于该数据集训练DTA-Llama，使其学会迭代地将当前任务分解为多个并行工具调用子任务，并通过聚合调用结果决定后续动作。此外，在将DTA-Llama应用于实际任务时，我们受进程/线程机制启发设计了高效推理框架。实验结果表明，我们的方法在显著提升任务性能的同时，有效降低了token消耗和推理时间。采用本方法的Llama2-7B模型性能可比肩GPT-3.5官方并行函数调用方法。相关代码、数据集及模型权重已发布于https://corn0205.github.io/。

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Abstract

arXiv:2501.13772v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attacks. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce Jailbreak-AudioBench, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also a range of audio editing techniques. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs, establishing the most comprehensive audio jailbreak benchmark to date. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.

摘要

大型语言模型（LLMs）在众多自然语言处理任务中展现出卓越的零样本性能。通过整合多模态编码器，其能力进一步扩展，催生了能够处理文本、视觉及听觉输入的多模态大型语言模型（MLLMs）。然而，这些先进能力也可能带来重大安全风险，例如模型可能被越狱攻击利用以生成有害或不恰当内容。尽管先前研究已深入探讨如何通过操纵文本或视觉模态输入来绕过LLMs和MLLMs的安全防护，但针对大型音频-语言模型（LALMs）的音频特异性越狱攻击脆弱性仍鲜有探索。为此，我们提出Jailbreak-AudioBench，包含工具箱、精选数据集和综合基准测试。该工具箱不仅支持文本到音频转换，还涵盖多种音频编辑技术；精选数据集则提供原始与编辑后的多样化显性与隐性越狱音频样本。基于此数据集，我们对多个前沿LALMs进行评估，构建了迄今最全面的音频越狱基准。最终，Jailbreak-AudioBench通过深度揭示更强大的越狱威胁（如基于查询的音频编辑）以及促进有效防御机制的开发，为推进LALMs安全对齐的未来研究奠定基础。

A statistically consistent measure of semantic uncertainty using Language Models

Abstract

arXiv:2502.00507v3 Announce Type: replace-cross Abstract: To address the challenge of quantifying uncertainty in the outputs generated by language models, we propose a novel measure of semantic uncertainty, semantic spectral entropy, that is statistically consistent under mild assumptions. This measure is implemented through a straightforward algorithm that relies solely on standard, pretrained language models, without requiring access to the internal generation process. Our approach imposes minimal constraints on the choice of language models, making it broadly applicable across different architectures and settings. Through comprehensive simulation studies, we demonstrate that the proposed method yields an accurate and robust estimate of semantic uncertainty, even in the presence of the inherent randomness characteristic of generative language model outputs.

摘要

为解决语言模型输出结果不确定性量化这一挑战，我们提出了一种新型语义不确定性度量方法——语义谱熵，该方法在温和假设下具有统计一致性。该度量通过一种仅需依赖标准预训练语言模型的简易算法实现，无需访问内部生成过程。我们的方法对语言模型选择施加极简约束，使其能够广泛适用于不同架构和场景。通过全面的模拟研究，我们证明即使面对生成式语言模型输出固有的随机性特征，所提方法仍能获得准确且鲁棒的语义不确定性估计。

A Checks-and-Balances Framework for Context-Aware Ethical AI Alignment

Abstract

arXiv:2502.00136v2 Announce Type: replace-cross Abstract: This paper introduces a checks-and-balances framework for ethical alignment of Large Language Models (LLMs), inspired by three-branch governmental systems. It implements three independent yet interacting components: LLMs as the executive branch for knowledge generation, DIKE as the legislative branch establishing ethical guardrails, and ERIS as the judicial branch for contextual interpretation. Beyond structural separation, we address a fundamental challenge: regulating emotion to shape behaviors. Drawing from psychological theories where managing emotional responses prevents harmful behaviors, we develop a self-supervised learning pipeline that maps emotions to linguistic behaviors, enabling precise behavioral modulation through emotional conditioning. By integrating this approach with adversarial testing, our framework demonstrates how DIKE and ERIS direct linguistic behaviors toward ethical outcomes while preserving independence throughout knowledge generation, ethical oversight, and contextual interpretation.

摘要

本文受三权分立政府体系启发，提出一个用于大语言模型伦理对齐的制衡框架。该框架包含三个独立且交互的组件：作为行政分支负责知识生成的大语言模型、作为立法分支建立伦理护栏的DIKE系统，以及作为司法分支进行情境解释的ERIS系统。除结构分离外，我们解决了一个核心挑战：通过情绪调控塑造行为。基于心理学理论中通过管理情绪反应来预防危害行为的机制，我们开发了一个自监督学习流程，将情绪映射到语言行为，从而通过情绪调节实现精确的行为调控。通过将该方法与对抗测试相结合，我们的框架展示了DIKE和ERIS如何在保持知识生成、伦理监督和情境解释全过程独立性的同时，引导语言行为达成伦理目标。

Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective

Abstract

arXiv:2502.00281v2 Announce Type: replace-cross Abstract: At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and it inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we represent the self-attention matrix as a mixture of experts and show that ``experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.

摘要

在流行的Transformer架构中，自注意力机制是其核心组件，该机制通过动态分配softmax权重给每个输入标记，使模型能够聚焦于最显著的信息。然而，softmax结构由于其逐行计算特性会减缓注意力计算速度，并且本质上引入了标记间的竞争：当分配给某个标记的权重增加时，其他标记的权重会相应减少。这种竞争动态可能导致自注意力的关注范围局限于少量特征，从而忽略其他具有信息量的特征。近期实验研究表明，采用逐元素的sigmoid函数有助于消除标记间竞争并降低计算开销。尽管这些实证结果颇具前景，但现有文献仍缺乏对sigmoid与softmax自注意力机制之间严谨的理论比较。本文通过理论证明填补了这一空白，表明sigmoid自注意力比softmax具有更高的样本效率。为实现这一目标，我们将自注意力矩阵表示为专家混合模型，并证明sigmoid自注意力中的'专家'要达到与softmax自注意力相同的近似误差，所需数据量显著更少。

Improving Rule-based Reasoning in LLMs via Neurosymbolic Representations

Abstract

arXiv:2502.01657v2 Announce Type: replace-cross Abstract: Large language models (LLMs) continue to face challenges in reliably solving reasoning tasks, particularly those that require precise rule following, as often found in mathematical reasoning. This paper introduces a novel neurosymbolic method that improves LLM reasoning by encoding hidden states into neurosymbolic vectors, enabling problem-solving within a neurosymbolic vector space. The results are decoded and merged with the original hidden state, significantly boosting the model's performance on numerical reasoning tasks. By offloading computation through neurosymbolic representations, this method enhances efficiency, reliability, and interpretability. Experimental results demonstrate an average of 88.6% lower cross-entropy loss and 15.4 times more problems correctly solved on a suite of mathematical reasoning tasks compared to chain-of-thought prompting and supervised fine-tuning (LoRA), without degrading performance on other tasks. We make our code available at: https://github.com/vdhanraj/Neurosymbolic-LLM.

摘要

大型语言模型（LLMs）在可靠解决推理任务（尤其是需要严格遵循规则的数学推理任务）方面仍面临挑战。本文提出一种新颖的神经符号方法，通过将隐藏状态编码为神经符号向量，实现在神经符号向量空间内解决问题。解码后的结果与原始隐藏状态融合，显著提升了模型在数值推理任务中的表现。该方法通过神经符号表示实现计算卸载，从而提高了效率、可靠性和可解释性。实验结果表明，与思维链提示和监督微调（LoRA）相比，在一系列数学推理任务中平均降低88.6%的交叉熵损失，正确求解问题数量提升15.4倍，且不影响其他任务表现。代码已开源：https://github.com/vdhanraj/Neurosymbolic-LLM。

UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models

Abstract

arXiv:2502.00334v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning. Codes and data are available at https://github.com/YangLabHKUST/UGPhysics .

摘要

大语言模型（LLMs）在解决复杂推理任务（尤其是数学领域）中展现出卓越能力，但物理推理领域特有的挑战尚未得到充分关注。现有基准测试常难以全面评估LLMs在本科物理广度和深度上的能力，凸显了建立系统性评估体系的必要性。为此，我们推出UGPhysics——一个专为评估本科物理（UGPhysics）推理能力而设计的大规模综合基准。该基准包含5,520道中英文本科物理题目，覆盖13个学科领域，涉及7种答案类型和4类物理推理技能，所有题目均经过严格的数据泄露筛查。我们还开发了基于模型辅助规则判断（MARJ）的评估流程，专门用于物理问题答案正确性判定，确保评估准确性。对31个主流LLMs的测试表明，最高总体准确率仅为49.8%（由OpenAI-o1-mini实现），这凸显了模型需要超越数学能力的物理推理技能。我们期待UGPhysics与MARJ能推动人工智能在物理推理领域的进步。代码与数据详见https://github.com/YangLabHKUST/UGPhysics。

JingFang: An Expert-Level Large Language Model for Traditional Chinese Medicine Clinical Consultation and Syndrome Differentiation-Based Treatment

Abstract

arXiv:2502.04345v2 Announce Type: replace-cross Abstract: The effective application of traditional Chinese medicine (TCM) requires extensive knowledge of TCM and clinical experience. The emergence of Large Language Models (LLMs) provides a solution to this, while existing LLMs for TCM exhibit critical limitations of incomplete clinical consultation and diagnoses, as well as inaccurate syndrome differentiation. To address these issues, we establish JingFang (JF), a novel TCM LLM that demonstrates the level of expertise in clinical consultation and syndrome differentiation. We propose a Multi-Agent Collaborative Chain-of-Thought Mechanism (MACCTM) for comprehensive and targeted clinical consultation, enabling JF with effective and accurate diagnostic ability. In addition, a Syndrome Agent and a Dual-Stage Recovery Scheme (DSRS) are developed to accurately enhance the differentiation of the syndrome and the subsequent corresponding treatment. JingFang not only facilitates the application of LLMs but also promotes the effective application of TCM for healthcare.

Polynomial, trigonometric, and tropical activations

Abstract

arXiv:2502.01247v2 Announce Type: replace-cross Abstract: Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.

摘要

本文探究了基于正交基的函数族，包括Hermite多项式基、傅里叶三角基以及多项式基热带化生成的基函数。研究表明，通过简单的方差保持初始化且无需额外钳制机制，这些激活函数能成功用于训练GPT-2（OpenWebText数据集的下一个词元预测）和ConvNeXt（ImageNet图像分类）等深度模型。我们的工作解决了多项式激活函数中尤为突出的激活值与梯度爆炸/消失问题，为提升大规模学习任务效率开辟了新途径。此外，该方法揭示了神经网络的结构特性：采用多项式激活的网络可解释为多元多项式映射。最后通过Hermite插值证明，我们的激活函数能通过匹配函数值及其导数，精确逼近预训练模型中的经典激活函数，特别适用于微调任务。相关激活函数已集成至torchortho库（访问地址：https://github.com/K-H-Ismail/torchortho）。

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Abstract

arXiv:2502.01534v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.

摘要

大型语言模型（LLMs）作为评估工具和基于LLM的数据合成已成为模型开发中两种基础的LLM驱动数据标注方法。尽管二者的结合显著提升了模型训练与评估的效率，但这种新型模型开发范式可能带来的潜在污染问题尚未得到足够关注。本研究揭示了"偏好泄漏"现象——当合成数据生成器与基于LLM的评估器存在关联性时，会引发LLM-as-a-judge场景中的污染问题。为探究该问题，我们首先定义了数据生成LLM与评估LLM之间的三种常见关联：同一模型、继承关系以及同属一个模型家族。通过大量实验，我们在多个LLM基线和基准测试中实证验证了因偏好泄漏导致的评估者对关联学生模型的偏向性。进一步分析表明，与先前发现的LLM-as-a-judge场景中的偏差相比，偏好泄漏是一个更普遍、更现实且更难检测的问题。这些发现均表明偏好泄漏是LLM-as-a-judge领域中广泛存在且具有挑战性的问题。我们已公开所有代码与数据：https://github.com/David-Li0406/Preference-Leakage。

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Abstract

arXiv:2502.01718v4 Announce Type: replace-cross Abstract: Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25% and MBPP-plus by 6% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

摘要

近年来编码器模型的进展主要得益于监督微调（SFT），而强化学习（RL）的潜力在很大程度上尚未被充分探索，这主要是由于代码领域缺乏可靠的奖励数据/模型。本文通过利用自动化大规模测试用例合成来增强代码模型训练，从而解决这一挑战。具体而言，我们设计了一个流程，能够从现有代码数据中生成大量（问题，测试用例）对。利用这些测试用例，我们基于采样程序通过率构建偏好对，并通过Bradley-Terry损失训练奖励模型。实验表明，在32选1采样中，Llama-3.1-8B-Ins平均提升了10个百分点，Qwen2.5-Coder-7B-Ins提升了5个百分点，使得7B模型性能与236B的DeepSeek-V2.5相当。此外，我们结合奖励模型和测试用例通过奖励进行强化学习，在HumanEval、MBPP、BigCodeBench和LiveCodeBench（V4）等基准上实现了持续改进。值得注意的是，我们采用R1风格训练方法直接从Qwen2.5-Coder-base开始训练，结果显示仅经过80次优化步骤，模型在HumanEval-plus上的性能提升超过25%，在MBPP-plus上提升6%。我们相信这些结果凸显了强化学习在编码器模型中的巨大潜力。

Mol-LLM: Multimodal Generalist Molecular LLM with Improved Graph Utilization

Abstract

arXiv:2502.02810v2 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have led to models that tackle diverse molecular tasks, such as chemical reaction prediction and molecular property prediction. Large-scale molecular instruction-tuning datasets have enabled sequence-only (e.g., SMILES or SELFIES) generalist molecular LLMs, and researchers are now exploring multimodal approaches that incorporate molecular structural information for further gains. However, a genuinely multimodal, generalist LLM that covers a broad spectrum of molecular tasks has yet to be fully investigated. We observe that naive next token prediction training ignores graph-structural information, limiting an LLM's ability to exploit molecular graphs. To address this, we propose (i) Molecular structure Preference Optimization (MolPO), which facilitates graph usage by optimizing preferences between pairs of correct and perturbed molecular structures, and (ii) an advanced graph encoder with a tailored pre-training strategy to improve the effect of graph utilization by MolPO. Building on these contributions, we introduce Mol-LLM, the first multimodal generalist model that (a) handles a broad spectrum of molecular tasks among molecular LLMs, (b) explicitly leverages molecular-structure information, and (c) takes advantage of extensive instruction tuning. Mol-LLM attains state-of-the-art or comparable results across the most comprehensive molecular-LLM benchmark-even on out-of-distribution datasets for reaction and property prediction, where it surpasses prior generalist molecular LLMs by a large margin.

摘要

大语言模型（LLM）的最新进展催生了能够处理多种分子任务的模型，如化学反应预测和分子性质预测。大规模分子指令调优数据集使得仅基于序列（如SMILES或SELFIES）的通用分子LLM成为可能，研究者们正探索融入分子结构信息的多模态方法以进一步提升性能。然而，真正覆盖广泛分子任务的多模态通用LLM尚未得到充分研究。我们发现，简单的下一词元预测训练会忽略图结构信息，限制了LLM利用分子图的能力。为此，我们提出：（i）分子结构偏好优化（MolPO），通过优化正确与扰动分子结构对之间的偏好来促进图结构利用；（ii）采用先进图编码器及定制预训练策略以增强MolPO的图结构利用效果。基于这些贡献，我们推出首个多模态通用模型Mol-LLM，其特点包括：（a）在分子LLM中覆盖最广泛的分子任务；（b）显式利用分子结构信息；（c）充分利用大规模指令调优。Mol-LLM在目前最全面的分子LLM基准测试中取得最优或可比结果——即使在反应和性质预测的分布外数据集上，其表现也显著超越现有通用分子LLM。

CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference

Abstract

arXiv:2502.04416v2 Announce Type: replace-cross Abstract: Scaling large language models (LLMs) improves performance but dramatically increases inference costs. The feed-forward network (FFN), consuming approximately 70% of inference compute, represents a critical bottleneck, particularly in large batch size scenarios. While mixture-of-experts (MoE) architectures leverage activation sparsity for efficiency, converting existing dense models to MoEs traditionally requires resource-intensive continual pre-training. We present CMoE, a framework that rapidly transforms dense LLMs into MoEs without training. The key innovation lies in analyzing FFN neuron activations to partition them into shared (always active) and routed experts. Routed neurons are clustered using a balanced assignment algorithm, and a differentiable router is constructed analytically from activation statistics, enabling immediate deployment or optional lightweight fine-tuning. Experiments demonstrate that, with activation ratio of 75%, it achieves remarkable results, delivering lossless precision in terms of perplexity while still maintaining a 5% acceleration. Further experiments reveal that a CMoE configuration activating just 25% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training. Moreover, a brief LoRA fine-tuning process (requiring only 1 hour and 2,000 samples) successfully recovers over 76% of the dense model's downstream accuracy. By effectively balancing performance and efficiency, CMoE offers a viable path forward for deploying LLMs in real-world scenarios where computational resources are limited. We make our code publicly available at https://github.com/JarvisPei/CMoE.

摘要

摘要：大规模语言模型（LLMs）的扩展虽能提升性能，却显著增加了推理成本。前馈网络（FFN）消耗约70%的推理计算资源，成为关键瓶颈，尤其在大批量场景下。尽管混合专家（MoE）架构利用激活稀疏性提升效率，但将现有稠密模型转换为MoE传统上需要资源密集的持续预训练。本文提出CMoE框架，无需训练即可快速将稠密LLMs转化为MoE。其核心创新在于通过分析FFN神经元激活模式，将其划分为共享（常激活）和路由专家两部分。路由神经元采用平衡分配算法聚类，并基于激活统计量解析构建可微分路由器，支持即时部署或可选轻量微调。实验表明，在75%激活率条件下，该方法实现了困惑度无损的精确度，同时保持5%的加速效果。进一步实验显示，仅激活25%参数的CMoE配置可将端到端延迟降低1.5倍，且无需额外训练即可维持可用困惑度。此外，通过简短LoRA微调（仅需1小时和2000样本）可恢复稠密模型76%以上的下游任务准确率。CMoE通过有效平衡性能与效率，为计算资源受限的实际场景部署LLMs提供了可行方案。代码已开源：https://github.com/JarvisPei/CMoE。

SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence

Abstract

arXiv:2502.08767v2 Announce Type: replace-cross Abstract: Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide better-grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information, an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.

摘要

为语言模型（LMs）在上下文中提供相关证据（通过检索或用户提供）可以显著提高其生成更具依据性回答的能力。然而，近期研究发现，语言模型往往难以充分理解并利用上下文中的关键证据，尤其是在包含噪声和无关信息的场景中，这一问题在现实应用中十分常见。为此，我们提出SelfElicit，一种推理时自引导显式高亮方法，通过利用语言模型深层注意力分数所固有的证据发现能力，自动识别并强调输入上下文中的关键证据，从而无需额外训练或迭代提示即可促进更准确且基于证据的响应。实验表明，SelfElicit在多种基于证据的问答任务上为不同系列语言模型带来了一致且显著的性能提升，同时保持了计算效率。我们的代码及文档详见https://github.com/ZhiningLiu1998/SelfElicit。

Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering

Abstract

arXiv:2502.07340v3 Announce Type: replace-cross Abstract: Training LLMs on data containing unfamiliar knowledge during the instruction tuning stage can encourage hallucinations. To address this challenge, we introduce NOVA, a novel framework designed to identify high-quality data that aligns well with the LLM's learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM's understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, to ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less.

摘要

在指令微调阶段对包含陌生知识的训练数据进行大型语言模型(LLM)训练可能诱发幻觉现象。为解决这一挑战，我们提出NOVA框架，该创新系统通过识别与LLM已掌握知识高度契合的高质量数据来减少幻觉。NOVA包含内部一致性探测(ICP)和语义等价识别(SEI)两大模块，用于量化LLM对指令数据的熟悉程度。具体而言，ICP通过计算模型多个自生成响应之间的定制化一致性，评估LLM对给定指令的理解深度；SEI则采用语义聚类技术和精心设计的投票策略，通过对比目标响应与生成响应来进一步评估LLM对预期输出的熟悉度。为确保筛选样本的质量，我们还引入专家对齐奖励模型，该模型综合考虑了熟悉度之外的多种特征。通过关注数据质量并规避陌生数据，我们能够利用筛选出的数据有效对齐LLM的指令跟随能力，显著降低幻觉发生率。

Provably Overwhelming Transformer Models with Designed Inputs

Abstract

arXiv:2502.06038v2 Announce Type: replace-cross Abstract: We develop an algorithm which, given a trained transformer model $\mathcal{M}$ as input, as well as a string of tokens $s$ of length $n_{fix}$ and an integer $n_{free}$ , can generate a mathematical proof that $\mathcal{M}$ is overwhelmed'' by $s$, in time and space $\widetilde{O}(n_{fix}^2 + n_{free}^3)$. We say that $\mathcal{M}$ is overwhelmed'' by $s$ when the output of the model evaluated on this string plus any additional string $t$ , $\mathcal{M}(s + t)$ , is completely insensitive to the value of the string $t$ whenever length( $t$ ) $\leq n_{free}$ . Along the way, we prove a particularly strong worst-case form of ``over-squashing'', which we use to bound the model's behavior. Our technique uses computer-aided proofs to establish this type of operationally relevant guarantee about transformer models. We empirically test our algorithm on a single layer transformer complete with an attention head, layer-norm, MLP/ReLU layers, and RoPE positional encoding. We believe that this work is a stepping stone towards the difficult task of obtaining useful guarantees for trained transformer models.

摘要

我们开发了一种算法，该算法以训练好的Transformer模型 $\mathcal{M}$ 作为输入，同时接收长度为 $n_{fix}$ 的标记字符串 $s$ 和整数 $n_{free}$ ，能够在 $\widetilde{O}(n_{fix}^2 + n_{free}^3)$ 的时间和空间复杂度内生成数学证明，证实 $\mathcal{M}$ 被 $s$ '完全压制'。当模型在该字符串加上任意附加字符串 $t$ （即 $\mathcal{M}(s + t)$ ）的输出对 $t$ 的值完全无反应（只要 $t$ 的长度 $\leq n_{free}$ ）时，我们称 $\mathcal{M}$ 被 $s$ '完全压制'。在此过程中，我们证明了一种特别强的'过度压缩'最坏情况形式，并利用该结论来界定模型行为。我们的技术采用计算机辅助证明来建立这类与Transformer模型操作相关的保证。我们在包含注意力头、层归一化、MLP/ReLU层和RoPE位置编码的单层Transformer上对算法进行了实证测试。本研究为获得训练后Transformer模型的有用保证这一艰巨任务奠定了重要基础。

Diffusion Instruction Tuning

Abstract

arXiv:2502.06814v2 Announce Type: replace-cross Abstract: We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.

摘要

我们提出Lavender，这是一种简单的监督微调（SFT）方法，通过利用最先进的图像生成模型（如Stable Diffusion）来提升先进视觉语言模型（VLM）的性能。具体而言，Lavender在SFT过程中将VLM Transformer中的文本-视觉注意力机制与Stable Diffusion所使用的对应机制对齐，而非调整独立的编码器。这种对齐丰富了模型的视觉理解能力，并显著提升了分布内和分布外任务的性能。Lavender仅需13万个训练样本（相当于典型大规模SFT数据集的2.5%），并在标准硬件（8块GPU）上一天内完成微调。该方法持续改进了当前最先进的开源多模态大语言模型（如Llama-3.2-11B、MiniCPM-Llama3-v2.5），在具有挑战性的分布外医学问答任务中实现了最高30%的性能提升和68%的显著进步。通过以最小监督高效迁移图像生成器的视觉专业知识，Lavender为构建更精准的视觉语言系统提供了可扩展的解决方案。所有代码、训练数据和模型将在https://astrazeneca.github.io/vlm/共享。

DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer's Disease

Abstract

arXiv:2502.04394v2 Announce Type: replace-cross Abstract: Alzheimer's Disease (AD) is an irreversible neurodegenerative disease affecting 50 million people worldwide. Low-cost, accurate identification of key markers of AD is crucial for timely diagnosis and intervention. Language impairment is one of the earliest signs of cognitive decline, which can be used to discriminate AD patients from normal control individuals. Patient-interviewer dialogues may be used to detect such impairments, but they are often mixed with ambiguous, noisy, and irrelevant information, making the AD detection task difficult. Moreover, the limited availability of AD speech samples and variability in their speech styles pose significant challenges in developing robust speech-based AD detection models. To address these challenges, we propose DECT, a novel speech-based domain-specific approach leveraging large language models (LLMs) for fine-grained linguistic analysis and label-switched label-preserved data generation. Our study presents four novelties: We harness the summarizing capabilities of LLMs to identify and distill key Cognitive-Linguistic information from noisy speech transcripts, effectively filtering irrelevant information. We leverage the inherent linguistic knowledge of LLMs to extract linguistic markers from unstructured and heterogeneous audio transcripts. We exploit the compositional ability of LLMs to generate AD speech transcripts consisting of diverse linguistic patterns to overcome the speech data scarcity challenge and enhance the robustness of AD detection models. We use the augmented AD textual speech transcript dataset and a more fine-grained representation of AD textual speech transcript data to fine-tune the AD detection model. The results have shown that DECT demonstrates superior model performance with an 11% improvement in AD detection accuracy on the datasets from DementiaBank compared to the baselines.

摘要

阿尔茨海默病（AD）是一种不可逆的神经退行性疾病，全球影响5000万人口。低成本、精准识别AD关键生物标志物对及时诊断和干预至关重要。语言障碍是认知衰退最早期的症状之一，可用于区分AD患者与正常对照组。医患对话可用于检测此类障碍，但常混杂模糊、嘈杂及无关信息，导致AD检测任务困难。此外，AD语音样本的有限性及其言语风格的差异性，对开发鲁棒的语音基AD检测模型构成重大挑战。针对这些问题，我们提出DECT——一种基于语音的领域特异性新方法，利用大语言模型（LLMs）进行细粒度语言分析和标签转换-标签保留的数据生成。本研究呈现四项创新：1）运用LLMs的总结能力从含噪语音文本中识别并提炼关键认知-语言信息，有效过滤无关内容；2）利用LLMs固有语言知识从非结构化异质音频文本中提取语言标记；3）发挥LLMs的组合生成能力构建包含多样化语言模式的AD语音文本，以克服语音数据稀缺问题并增强AD检测模型鲁棒性；4）使用增广的AD文本语音数据集及更细粒度的AD文本语音表征对检测模型进行微调。实验结果表明，在DementiaBank数据集上，DECT相较基线模型展现出11%的AD检测准确率提升，具有显著性能优势。

MELON: Provable Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison

Abstract

arXiv:2502.05174v3 Announce Type: replace-cross Abstract: Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent's next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs. Code is available at https://github.com/kaijiezhu11/MELON.

摘要

近期研究表明，大型语言模型（LLM）智能体易受间接提示注入（IPI）攻击的影响，此类攻击通过工具检索信息中嵌入的恶意任务，诱导智能体执行未授权操作。现有IPI防御方案存在显著局限：或需消耗大量模型训练资源，或对复杂攻击有效性不足，或损害正常功能。本文提出MELON（掩码重执行与工具对比），一种新型IPI防御方法。我们的方案基于关键发现：在成功攻击下，智能体的后续动作对用户任务的依赖性降低，而对恶意任务的依赖性增强。基于此，我们设计MELON通过掩码函数修改用户提示后重执行智能体轨迹来检测攻击——若原始执行与掩码执行生成的动作相似则判定为攻击。方案包含三项核心设计以降低误报与漏报风险。在IPI基准测试AgentDojo上的大量实验表明，MELON在攻击防御与功能保持方面均优于当前最优（SOTA）方案。进一步研究表明，将MELON与SOTA提示增强防御方案结合（记为MELON-Aug）可进一步提升性能。我们还通过详细消融实验验证了核心设计的有效性。代码已开源：https://github.com/kaijiezhu11/MELON。

QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language

Abstract

arXiv:2502.09723v3 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack not only can achieve high attack success rates (ASRs), but also can jailbreak various defense methods. Furthermore, we tailor a defense method against QueryAttack, which can reduce ASR by up to $64\%$ on GPT-4-1106. Our code is available at https://github.com/horizonsinzqs/QueryAttack.

摘要

大语言模型（LLMs）的最新进展在自然语言处理领域展现出显著潜力。然而，LLMs面临着重大的安全与伦理风险。尽管已开发出安全对齐等技术进行防御，先前研究表明通过精心设计的越狱攻击可能绕过此类防御。本文提出QueryAttack框架，用于检验安全对齐的泛化能力。通过将LLMs视为知识数据库，我们将自然语言中的恶意查询转换为结构化的非自然查询语言，从而绕过LLMs的安全对齐机制。我们在主流LLMs上进行了大量实验，结果表明QueryAttack不仅能实现高攻击成功率（ASRs），还能突破多种防御方法。此外，我们专门设计了一种针对QueryAttack的防御方法，可在GPT-4-1106上将ASR降低多达64%。代码详见https://github.com/horizonsinzqs/QueryAttack。

A Survey of LLM-based Agents in Medicine: How far are we from Baymax?

Abstract

arXiv:2502.11211v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are transforming healthcare through the development of LLM-based agents that can understand, reason about, and assist with medical tasks. This survey provides a comprehensive review of LLM-based agents in medicine, examining their architectures, applications, and challenges. We analyze the key components of medical agent systems, including system profiles, clinical planning mechanisms, medical reasoning frameworks, and external capacity enhancement. The survey covers major application scenarios such as clinical decision support, medical documentation, training simulations, and healthcare service optimization. We discuss evaluation frameworks and metrics used to assess these agents' performance in healthcare settings. While LLM-based agents show promise in enhancing healthcare delivery, several challenges remain, including hallucination management, multimodal integration, implementation barriers, and ethical considerations. The survey concludes by highlighting future research directions, including advances in medical reasoning inspired by recent developments in LLM architectures, integration with physical systems, and improvements in training simulations. This work provides researchers and practitioners with a structured overview of the current state and future prospects of LLM-based agents in medicine.

摘要

大型语言模型（LLM）正通过开发能够理解、推理并协助完成医疗任务的基于LLM的智能体，推动医疗健康领域的变革。本综述对医学领域基于LLM的智能体进行了全面回顾，系统考察了其架构体系、应用场景与现存挑战。我们分析了医疗智能体系统的关键组成部分，包括系统配置、临床规划机制、医学推理框架及外部能力增强模块。研究涵盖临床决策支持、医疗文书处理、培训模拟及医疗服务优化等主要应用场景，并探讨了用于评估这些智能体在医疗环境中性能的评价框架与指标体系。尽管基于LLM的智能体在提升医疗服务方面展现出潜力，但仍面临幻觉管理、多模态整合、实施障碍及伦理考量等诸多挑战。综述最后指明了未来研究方向，包括借鉴LLM架构最新进展的医学推理改进、与物理系统的整合以及培训模拟的优化。本研究为科研人员和从业者提供了关于医学领域基于LLM智能体发展现状与未来前景的结构化综述。

Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning

Abstract

arXiv:2502.11962v2 Announce Type: replace-cross Abstract: Instruction fine-tuning (IFT) can increase the informativeness of large language models (LLMs), but may reduce their truthfulness. This trade-off arises because IFT steers LLMs to generate responses containing long-tail knowledge that was not well covered during pre-training. As a result, models become more informative but less accurate when generalizing to unseen tasks. In this paper, we empirically demonstrate how unfamiliar knowledge in IFT datasets can negatively affect the truthfulness of LLMs, and we introduce two new IFT paradigms, $UNIT_{cut}$ and $UNIT_{ref}$ , to address this issue. $UNIT_{cut}$ identifies and removes unfamiliar knowledge from IFT datasets to mitigate its impact on model truthfulness, whereas $UNIT_{ref}$ trains LLMs to recognize their uncertainty and explicitly indicate it at the end of their responses. Our experiments show that $UNIT_{cut}$ substantially improves LLM truthfulness, while $UNIT_{ref}$ maintains high informativeness and reduces hallucinations by distinguishing between confident and uncertain statements.

摘要

指令微调（IFT）能够增强大语言模型（LLM）的信息性，但可能降低其真实性。这种权衡的出现是因为IFT引导LLM生成包含预训练阶段未充分覆盖的长尾知识的响应，导致模型在泛化至未见任务时虽更具信息性却准确性下降。本文通过实证研究揭示了IFT数据集中陌生知识如何对LLM真实性产生负面影响，并提出两种新型IFT范式—— $UNIT_{cut}$ 与 $UNIT_{ref}$ 以解决该问题。 $UNIT_{cut}$ 通过识别并剔除IFT数据集中的陌生知识来减轻其对模型真实性的影响，而 $UNIT_{ref}$ 则训练LLM识别自身不确定性并在响应末尾明确标注。实验表明， $UNIT_{cut}$ 显著提升了LLM真实性， $UNIT_{ref}$ 则通过区分确信与不确定陈述，在保持高信息性的同时有效减少了幻觉现象。

ReviewEval: An Evaluation Framework for AI-Generated Reviews

Abstract

arXiv:2502.11736v3 Announce Type: replace-cross Abstract: The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. In this work, we propose: 1. ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines; and 2. ReviewAgent, an LLM-based review generation agent featuring a novel alignment mechanism to tailor feedback to target conferences and journals, along with a self-refinement loop that iteratively optimizes its intermediate outputs and an external improvement loop using ReviewEval to improve upon the final reviews. ReviewAgent improves actionable insights by 6.78% and 47.62% over existing AI baselines and expert reviews respectively. Further, it boosts analytical depth by 3.97% and 12.73%, enhances adherence to guidelines by 10.11% and 47.26% respectively. This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.

摘要

随着学术研究数量的激增与合格审稿人的短缺，同行评审亟需创新方法。本研究提出：1. ReviewEval——一个针对AI生成审稿意见的综合评估框架，用于衡量其与人类评估的一致性、验证事实准确性、评估分析深度、识别建设性程度及对审稿准则的遵循度；2. ReviewAgent——基于大语言模型的审稿生成智能体，其创新性对齐机制可针对目标会议/期刊定制反馈意见，通过自优化循环迭代改进中间输出，并利用ReviewEval构成外部改进循环以优化最终审稿意见。相比现有AI基准和专家审稿，ReviewAgent分别将可操作性建议提升47.62%和6.78%，分析深度提高12.73%和3.97%，对审稿准则的遵循度增强47.26%和10.11%。本研究为基于AI的同行评审建立了核心评估指标，显著提升了AI生成审稿意见在学术研究中的可靠性与影响力。

DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services

Abstract

arXiv:2502.11417v2 Announce Type: replace-cross Abstract: The rapid rise of large language models (LLMs) in text streaming services has introduced significant cost and Quality of Experience (QoE) challenges in serving millions of daily requests, especially in meeting Time-To-First-Token (TTFT) and Time-Between-Token (TBT) requirements for real-time interactions. Our real-world measurements show that both server-based and on-device deployments struggle to meet diverse QoE demands: server deployments face high costs and last-hop issues (e.g., Internet latency and dynamics), while on-device LLM inference is constrained by resources. We introduce DiSCo, a device-server cooperative scheduler designed to optimize users' QoE by adaptively routing requests and migrating response generation between endpoints while maintaining cost constraints. DiSCo employs cost-aware scheduling, leveraging the predictable speed of on-device LLM inference with the flexible capacity of server-based inference to dispatch requests on the fly, while introducing a token-level migration mechanism to ensure consistent token delivery during migration. Evaluations on real-world workloads -- including commercial services like OpenAI GPT and DeepSeek, and open-source deployments such as LLaMA3 -- show that DiSCo can improve users' QoE by reducing tail TTFT (11-52%) and mean TTFT (6-78%) across different model-device configurations, while dramatically reducing serving costs by up to 84% through its migration mechanism while maintaining comparable QoE levels.

摘要

大型语言模型（LLMs）在文本流服务中的迅速崛起，在为每日数百万请求提供服务时带来了显著的成本与体验质量（QoE）挑战，特别是在满足实时交互的首令牌时间（TTFT）和令牌间隔时间（TBT）需求方面。我们的实际测量表明，基于服务器和本地设备的部署均难以满足多样化的QoE需求：服务器部署面临高成本和最后一跳问题（如网络延迟与动态性），而本地设备上的LLM推理则受限于资源约束。

我们提出DiSCo——一种设备-服务器协同调度器，通过自适应路由请求并在终端间迁移响应生成来优化用户QoE，同时保持成本约束。DiSCo采用成本感知调度机制，结合本地设备LLM推理的可预测速度与服务器推理的弹性能力实现动态请求分配，并引入令牌级迁移机制以确保迁移过程中的令牌交付一致性。基于真实场景工作负载的评估（包括OpenAI GPT、DeepSeek等商业服务及LLaMA3等开源部署）显示：DiSCo能通过降低尾部TTFT（11-52%）和平均TTFT（6-78%）提升不同模型-设备配置下的用户QoE，同时借助迁移机制在保持相当QoE水平的前提下，将服务成本最高降低84%。

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Abstract

arXiv:2502.12067v2 Announce Type: replace-cross Abstract: Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop.

摘要

链式思考（CoT）已被证实能有效增强大语言模型（LLM）的推理能力。OpenAI的o1和DeepSeek-R1等最新研究表明，在推理过程中扩展CoT序列长度可进一步提升LLM的推理性能。然而，由于LLM解码的自回归特性，更长的CoT输出会导致推理延迟线性增加，尤其当CoT超过10,000个token时，会严重影响用户体验。为突破这一限制，我们分析了CoT输出中各个token的语义重要性，发现其对推理的贡献度存在显著差异。基于此发现，我们提出TokenSkip——一种简单高效的方法，使LLM能够选择性跳过次要token，实现可控的CoT压缩。跨多种模型和任务的实验表明，TokenSkip能在保持强劲推理性能的同时显著减少CoT的token消耗。值得注意的是，在Qwen2.5-14B-Instruct模型上应用TokenSkip时，GSM8K数据集的推理token数量减少40%（从313降至181），性能下降幅度不足0.4%。

Conditioning LLMs to Generate Code-Switched Text

Abstract

arXiv:2502.12924v2 Announce Type: replace-cross Abstract: Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.

摘要

代码切换（CS）仍是自然语言处理（NLP）领域的关键挑战。当前大规模语言模型（LLMs）在理解和生成代码切换文本方面存在困难，主要源于缺乏大规模训练用CS数据集。本文提出一种利用LLMs生成CS数据的新方法，并以英语-西班牙语对进行测试。我们采用回译技术将自然CS句子转换为单语英语，进而利用生成的平行语料库对LLMs进行微调，使其能将单语句子转化为CS文本。与以往CS生成方法不同，本方法以自然CS数据为起点，使模型能够学习超越语法规则的自然分布特征。我们通过人类偏好研究、定性错误分析及主流自动指标评估，对模型性能进行全面分析。结果表明：该方法能生成流畅的代码切换文本，为CS通信研究开辟新途径；同时发现传统评估指标在衡量生成CS数据质量时与人类判断不相符。代码及生成数据集以CC-BY-NC-SA许可协议公开发布。

A Cognitive Writing Perspective for Constrained Long-Form Text Generation

Abstract

arXiv:2502.12568v3 Announce Type: replace-cross Abstract: Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: \href{https://github.com/KaiyangWan/CogWriter}{CogWriter}.

摘要

与人类类似，大型语言模型（LLM）难以在单次生成中创作出符合严格要求的高质量长文本。这一挑战并不令人意外，因为根据认知写作理论，成功的人类写作是一个复杂的认知过程，涉及迭代式的规划、转换、审查与监控。受这些认知原则启发，我们旨在通过CogWriter这一新型免训练框架，为LLM赋予类人的认知写作能力，将受限长文本生成转化为系统化的认知写作范式。该框架包含两个核心模块：（1）执行分层任务分解的规划代理；（2）并行执行生成任务的多个生成代理。系统通过持续的监控与审查机制维持质量，这些机制会评估输出是否符合指定要求并触发必要修订。CogWriter在长文本生成基准测试LongGenBench上展现出卓越性能：即使采用Qwen-2.5-14B作为基础模型，其复杂指令完成准确率仍超越GPT-4o达22%，并能稳定生成超1万字的文本。我们希望这种受认知科学启发的路径能为LLM写作技术发展提供新范式。

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

Abstract

arXiv:2502.13311v3 Announce Type: replace-cross Abstract: Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students towards completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our approach can be extended beyond coding, providing valuable insights into advancing tutoring agents for human task learning.

摘要

基于大语言模型（LLM）的智能辅导代理在语言学习和科学教育等领域的个性化知识传授中日益受到关注。然而，这些代理在指导用户解决复杂现实任务方面的能力仍待深入探索。针对这一局限，本研究聚焦编程辅导这一具有挑战性的问题，该任务要求辅导者主动引导学生完成预定义的编程练习。我们提出了一种新型代理工作流程——追踪与验证（TRAVER），该方法结合知识追踪（用于评估学生知识状态）和逐轮验证（确保任务完成的引导有效性）。我们开发了DICT评估协议，通过受控学生模拟和代码生成测试对辅导代理进行自动化评估。大量实验揭示了编程辅导的难点，并证明TRAVER能显著提高任务成功率。尽管本文以编程辅导为例，但该方法可扩展至其他领域，为推进人类任务学习的辅导代理研究提供了重要启示。

A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?

Abstract

arXiv:2502.14924v2 Announce Type: replace-cross Abstract: Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.

摘要

语言在其信息论复杂度（即每标记比特数）中展现出分形结构，具有跨尺度的自相似性和长程依赖性（LRD）。本研究探讨了大语言模型（LLMs）能否复现此类分形特征，并识别可能导致其失效的条件（如温度设置和提示方法）。此外，我们发现自然语言中观察到的分形参数集中于狭窄范围内，而LLMs输出的分形参数则呈现较大波动，这表明分形参数可能有助于检测相当比例的LLM生成文本。值得注意的是，本研究的这些发现及其他多项结论对模型架构选择（如Gemini 1.0 Pro、Mistral-7B和Gemma-2B）具有鲁棒性。我们还发布了包含24万篇由不同LLM（包括预训练模型和指令调优模型）采用多种解码温度及提示方法生成的文章数据集，并附对应人类撰写文本。希望这项工作能揭示LLMs中分形特性、提示方法与统计模拟之间复杂的相互作用，为合成文本的生成、评估与检测提供新见解。

CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter

Abstract

arXiv:2502.16880v3 Announce Type: replace-cross Abstract: Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training and inference. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge. To address this, we propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting. CORAL introduces Cross-Step Representation Alignment, a method that enhances consistency across multiple training steps, significantly improving speculative drafting performance. Additionally, we identify the LM head as a major bottleneck in the inference speed of the draft model. We introduce a weight-grouping mechanism that selectively activates a subset of LM head parameters during inference, substantially reducing the latency of the draft model. We evaluate CORAL on three LLM families and three benchmark datasets, achieving speedup ratios of 2.50x-4.07x, outperforming state-of-the-art methods such as EAGLE-2 and HASS. Our results demonstrate that CORAL effectively mitigates training-inference misalignment and delivers significant speedup for modern LLMs with large vocabularies.

摘要

推测解码是一种通过轻量级推测草稿模型加速大语言模型（LLM）推理的强大技术。然而，现有设计因训练与推理之间的不匹配而影响性能。近期方法尝试通过多步训练策略解决该问题，但不同训练步骤的复杂输入导致草稿模型更难收敛。为此，我们提出CORAL框架，显著提升推测草稿的准确性与效率。CORAL引入跨步表征对齐方法，增强多训练步骤间的一致性，从而大幅改善推测草稿性能。此外，我们发现LM头部是草稿模型推理速度的主要瓶颈，提出权重分组机制，在推理时选择性激活部分LM头部参数，显著降低延迟。我们在三个LLM系列和三个基准数据集上评估CORAL，实现2.50倍至4.07倍的加速比，优于EAGLE-2和HASS等先进方法。结果表明，CORAL有效缓解了训练-推理失配问题，并为大词汇量现代LLM带来显著加速。

Abstract

arXiv:2502.15361v2 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have enabled automatic generation of chain-of-thought (CoT) reasoning, leading to strong performance on tasks such as math and code. However, when reasoning steps reflect social stereotypes (e.g., those related to gender, race or age), they can reinforce harmful associations and lead to misleading conclusions. We present the first systematic evaluation of social bias within LLM-generated reasoning, using the BBQ dataset to analyze both prediction accuracy and bias. Our study spans a wide range of mainstream reasoning models, including instruction-tuned and CoT-augmented variants of DeepSeek-R1 (8B/32B), ChatGPT, and other open-source LLMs. We quantify how biased reasoning steps correlate with incorrect predictions and often lead to stereotype expression. To mitigate reasoning-induced bias, we propose Answer Distribution as Bias Proxy (ADBP), a lightweight mitigation method that detects bias by tracking how model predictions change across incremental reasoning steps. ADBP outperforms a stereotype-free baseline in most cases, mitigating bias and improving the accuracy of LLM outputs. Code will be released upon paper acceptance.

摘要

大语言模型（LLMs）的最新进展实现了自动生成思维链（CoT）推理，显著提升了数学和代码等任务的表现。然而，当推理步骤反映社会刻板印象（如涉及性别、种族或年龄的关联）时，可能强化有害偏见并导致误导性结论。我们首次对LLM生成推理中的社会偏见进行系统评估，使用BBQ数据集同时分析预测准确性与偏见程度。研究涵盖主流推理模型的广泛范围，包括DeepSeek-R1（8B/32B）、ChatGPT的指令微调与CoT增强变体及其他开源LLMs。我们量化了带有偏见的推理步骤如何与错误预测相关联，并频繁导致刻板印象表达。为缓解推理诱发的偏见，提出"答案分布作为偏见代理"（ADBP）的轻量级缓解方法，通过追踪模型预测在增量推理步骤中的变化来检测偏见。在多数情况下，ADBP表现优于无刻板印象基线，既能减轻偏见又可提升LLM输出的准确性。代码将在论文录用后公开。

Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch

Abstract

arXiv:2502.17173v3 Announce Type: replace-cross Abstract: Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.

摘要

奖励模型（RMs）对于将大型语言模型（LLMs）与人类偏好对齐至关重要。然而，现有研究主要集中于英语领域且严重依赖合成资源，导致中文领域的可靠数据集与基准测试较为匮乏。为填补这一空白，我们提出了CheemsBench——一个完全由人工标注的中文语境RM评估基准，以及CheemsPreference——通过人机协作标注的大规模多样化偏好数据集，用于支持中文RM训练。我们在CheemsBench上系统评估了开源判别式与生成式RMs，发现这些模型在捕捉中文场景人类偏好方面存在显著局限。基于CheemsPreference构建的RM在CheemsBench上实现了最优性能，证实了人工监督在RM训练中的必要性。研究结果表明，单纯扩展AI生成数据难以完整捕捉人类偏好，这凸显了高质量人工监督在RM开发中的关键作用。

Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs

Abstract

arXiv:2502.18791v3 Announce Type: replace-cross Abstract: The surge of LLM studies makes synthesizing their findings challenging. Analysis of experimental results from literature can uncover important trends across studies, but the time-consuming nature of manual data extraction limits its use. Our study presents a semi-automated approach for literature analysis that accelerates data extraction using LLMs. It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset, LLMEvalDB. We then conduct an automated literature analysis of frontier LLMs, reducing the effort of paper surveying and data extraction by more than 93% compared to manual approaches. We validate LLMEvalDB by showing that it reproduces key findings from a recent manual analysis of Chain-of-Thought (CoT) reasoning and also uncovers new insights that go beyond it, showing, for example, that in-context examples benefit coding & multimodal tasks but offer limited gains in math reasoning tasks compared to zero-shot CoT. Our automatically updatable dataset enables continuous tracking of target models by extracting evaluation studies as new data becomes available. Through LLMEvalDB and empirical analysis, we provide insights into LLMs while facilitating ongoing literature analyses of their behavior.

摘要

大语言模型（LLM）研究的激增使得综合其研究成果具有挑战性。对文献中实验结果的分析可以揭示跨研究的重要趋势，但人工数据提取的耗时性限制了其应用。本研究提出一种半自动化的文献分析方法，利用LLM加速数据提取。该方法能自动识别相关arXiv论文，提取实验结果及相关属性，并将其组织成结构化数据集LLMEvalDB。随后我们对前沿LLM进行自动化文献分析，相比人工方法减少了93%以上的论文调研和数据提取工作量。通过验证LLMEvalDB，我们证明其能复现近期人工分析思维链（CoT）推理的关键发现，并进一步揭示新见解，例如：上下文示例有益于编码和多模态任务，但在数学推理任务中与零样本CoT相比增益有限。我们的可自动更新数据集能通过提取新出现的评估研究，持续追踪目标模型。借助LLMEvalDB和实证分析，我们不仅深入理解LLM特性，还为持续开展其行为特征的文献分析提供了便利。

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Abstract

arXiv:2503.01763v2 Announce Type: replace-cross Abstract: Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

摘要

工具学习旨在通过多样化工具增强大语言模型（LLMs），使其能够作为智能体解决实际任务。由于工具调用型LLMs的上下文长度有限，采用信息检索（IR）模型从大型工具集中筛选有用工具成为关键初始步骤。然而，IR模型在工具检索任务中的性能尚未得到充分研究和明确验证。现有工具使用基准大多通过人工预标注少量相关工具来简化这一步骤，这与真实场景相去甚远。本文提出ToolRet——一个包含7.6k个多样化检索任务的异构工具检索基准，以及从现有数据集中收集的43k个工具组成的语料库。我们对六类模型在ToolRet上进行基准测试。令人惊讶的是，即使在传统IR基准中表现优异的模型，在ToolRet上也表现欠佳。这种低检索质量会降低工具调用型LLMs的任务通过率。作为进一步贡献，我们提供了包含20万+实例的大规模训练数据集，显著优化了IR模型的工具检索能力。

Detecting LLM-Generated Korean Text through Linguistic Feature Analysis

Abstract

arXiv:2503.00032v3 Announce Type: replace-cross Abstract: The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns can hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUROC compared to the best-performing existing detection method. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_generated_korean_text_through_linguistic_analysis.

摘要

大型语言模型(LLMs)的快速发展使得区分人类撰写文本与LLM生成文本的难度日益增加。检测LLM生成文本对于维护学术诚信、防止剽窃、保护版权以及确保研究伦理实践至关重要。现有大多数关于LLM生成文本检测的研究主要集中于英语文本。然而，具有独特形态和句法特征的语言需要专门的检测方法，其独特的结构和用法模式会阻碍主要为英语设计的方法直接应用。在此类语言中，我们聚焦于韩语——相比英语，韩语具有相对灵活的间距规则、丰富的形态系统以及更少使用逗号的特点。我们提出了首个用于检测LLM生成韩语文本的基准数据集KatFish，该数据集包含人类撰写文本及四种LLM在三种文体中生成的文本。通过分析间距模式、词性多样性和逗号使用情况，我们揭示了人类撰写与LLM生成韩语文本之间的语言学差异。基于这些发现，我们提出了专门针对韩语设计的检测方法KatFishNet。与现有最佳检测方法相比，KatFishNet实现了平均19.78%的AUROC提升。我们的代码和数据可在https://github.com/Shinwoo-Park/detecting_llm_generated_korean_text_through_linguistic_analysis获取。

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

Abstract

arXiv:2503.03862v2 Announce Type: replace-cross Abstract: Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.

摘要

语言模型能力的提升通常归因于模型规模或训练数据的增加，但在某些情况下，经过数据筛选训练的小型模型或采用不同架构决策的模型，其表现可能优于使用更多标记训练的大型模型。如何解释这一现象？为量化这些设计选择的影响，我们对92个开源预训练模型进行了元分析，涵盖各种规模，包括最先进的开源权重模型、性能较弱的模型以及采用非常规设计决策的模型。研究发现，通过纳入除模型规模和训练标记数量之外的特征，与仅使用规模因素相比，我们预测下游任务性能的能力可相对提升3-28%。对模型设计决策的分析揭示了数据构成的深层规律，例如语言任务与代码任务在15-25%代码占比时的权衡关系，以及某些架构决策（如选择旋转嵌入而非学习嵌入）的优越性。总体而言，我们的研究框架为系统性探究模型开发选择如何影响最终能力奠定了基础。

LINGOLY-TOO: Disentangling Memorisation from Knowledge with Linguistic Templatisation and Orthographic Obfuscation

Abstract

arXiv:2503.02972v4 Announce Type: replace-cross Abstract: The expanding knowledge and memorisation capacity of frontier language models allows them to solve many reasoning tasks directly by exploiting prior knowledge, leading to inflated estimates of their reasoning abilities. We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language and designed to counteract the effect of non-reasoning abilities on reasoning estimates. Using linguistically informed rulesets, we permute reasoning problems written in real languages to generate numerous question variations. These permutations preserve the intrinsic reasoning steps required for each solution while reducing the likelihood problems are directly solvable with models' knowledge. Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge. On a metric that rewards consistent reasoning, all models perform poorly and exhibit high variance across question permutations, indicating that Large Language Models' (LLMs) reasoning faculty remains brittle. Overall, results on the benchmark reflect the recent progress of Inference-Time Compute (ITC) models but suggest ample room for further improvement. The benchmark is a step towards better measurement of reasoning abilities of LLMs and offers a cautionary tale on the importance of disentangling reasoning abilities from models' internalised knowledge when developing reasoning benchmarks.

摘要

前沿语言模型不断扩展的知识储备与记忆能力使其能够直接利用先验知识解决诸多推理任务，这导致对其推理能力的高估。我们提出LINGOLY-TOO这一基于自然语言的挑战性推理基准，旨在消除非推理因素对能力评估的影响。通过采用语言学规则集，我们对真实语言编写的推理问题进行排列组合，生成大量问题变体。这些变体在保留解题所需内在推理步骤的同时，降低了模型直接调用知识求解的可能性。实验与分析表明，模型能够绕过推理过程而依赖先验知识作答。在衡量推理一致性的指标上，所有模型均表现不佳且在不同问题变体间呈现高方差性，表明大语言模型（LLMs）的推理能力仍具脆弱性。总体而言，基准测试结果反映了推理时计算（ITC）模型的最新进展，但揭示出巨大改进空间。该基准朝着更准确评估LLMs推理能力迈出一步，并为开发推理基准时需区分推理能力与模型内化知识的重要性提供了警示案例。

PersonaX: A Recommendation Agent Oriented User Modeling Framework for Long Behavior Sequence

Abstract

arXiv:2503.02398v2 Announce Type: replace-cross Abstract: User profile embedded in the prompt template of personalized recommendation agents play a crucial role in shaping their decision-making process. High-quality user profiles are essential for aligning agent behavior with real user interests. Typically, these profiles are constructed by leveraging LLMs for user profile modeling (LLM-UM). However, this process faces several challenges: (1) LLMs struggle with long user behaviors due to context length limitations and performance degradation. (2) Existing methods often extract only partial segments from full historical behavior sequence, inevitably discarding diverse user interests embedded in the omitted content, leading to incomplete modeling and suboptimal profiling. (3) User profiling is often tightly coupled with the inference context, requiring online processing, which introduces significant latency overhead. In this paper, we propose PersonaX, an agent-agnostic LLM-UM framework to address these challenges. It augments downstream recommendation agents to achieve better recommendation performance and inference efficiency. PersonaX (a) segments complete historical behaviors into clustered groups, (b) selects multiple sub behavior sequences (SBS) with a balance of prototypicality and diversity to form a high quality core set, (c) performs offline multi-persona profiling to capture diverse user interests and generate fine grained, cached textual personas, and (d) decouples user profiling from online inference, enabling profile retrieval instead of real time generation. Extensive experiments demonstrate its effectiveness: using only 30 to 50% of behavioral data (sequence length 480), PersonaX enhances AgentCF by 3 to 11% and Agent4Rec by 10 to 50%. As a scalable and model-agnostic LLM-UM solution, PersonaX sets a new benchmark in scalable user modeling.

摘要

嵌入个性化推荐代理提示模板中的用户画像对其决策过程具有关键影响。高质量用户画像对于使代理行为与真实用户兴趣保持一致至关重要。典型方法是通过大语言模型进行用户画像建模（LLM-UM），但该过程面临三大挑战：（1）由于上下文长度限制和性能衰减，大语言模型难以处理长用户行为序列；（2）现有方法通常仅从完整历史行为序列中提取部分片段，不可避免地丢弃了被忽略内容中蕴含的多元用户兴趣，导致建模不完整和画像次优；（3）用户画像常与推理上下文紧密耦合，需在线处理从而引入显著延迟开销。本文提出PersonaX框架，这是一种与代理无关的LLM-UM解决方案，通过四项创新应对上述挑战：（a）将完整历史行为分割为聚类群组；（b）选取兼具原型代表性与多样性的多组子行为序列构成高质量核心集；（c）离线生成多维度画像以捕捉多元用户兴趣，形成细粒度、可缓存的文本化人物角色；（d）将用户画像与在线推理解耦，通过画像检索替代实时生成。大量实验证明其有效性：仅需30-50%行为数据（序列长度480），PersonaX使AgentCF提升3-11%、Agent4Rec提升10-50%。作为可扩展且模型无关的LLM-UM方案，PersonaX为可扩展用户建模设立了新基准。

InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

Abstract

arXiv:2503.06692v3 Announce Type: replace-cross Abstract: Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.

摘要

大语言模型中的高级推理在挑战性任务上已取得显著性能，但主流的长上下文推理范式存在关键局限：计算量随序列长度呈二次方增长、推理受限于最大上下文边界、以及超出预训练上下文窗口时的性能下降。现有方法主要压缩推理链而未解决根本的扩展问题。为克服这些挑战，我们提出InftyThink范式，将整体式推理转化为带中间摘要的迭代过程。通过在短推理段之间插入简明进度摘要，我们的方法在保持有限计算成本的同时实现无界推理深度，形成特有的锯齿状记忆模式，较传统方法显著降低计算复杂度。此外，我们开发了将长上下文推理数据集重构为迭代格式的方法，将OpenR1-Math转化为333K训练实例。多模型架构实验表明，该方法在降低计算成本的同时提升性能，Qwen2.5-Math-7B在MATH500、AIME24和GPQA_diamond基准上实现3-13%的改进。本研究挑战了推理深度与计算效率间的固有权衡，为无需架构修改的复杂推理提供了更具扩展性的解决方案。

General Table Question Answering via Answer-Formula Joint Generation

Abstract

arXiv:2503.12345v2 Announce Type: replace-cross Abstract: Advanced table question answering (TableQA) methods prompt large language models (LLMs) to generate answer text, SQL query, Python code, or custom operations, which impressively improve the complex reasoning problems in the TableQA task. However, these methods lack the versatility to cope with specific question types or table structures. In contrast, the Spreadsheet Formula, the widely used and well-defined operation language for tabular data, has not been thoroughly explored to solve TableQA. In this paper, we first attempt to use the Formula as the executable representation for solving complex reasoning on tables with different structures. Specifically, we construct \texttt{FromulaQA}, a large Formula-annotated TableQA dataset from existing datasets. In addition, we propose \texttt{TabAF}, a general table answering framework to solve multiple types of tasks over multiple types of tables simultaneously. Unlike existing methods, \texttt{TabAF} decodes answers and Formulas with a single LLM backbone, demonstrating great versatility and generalization. \texttt{TabAF} based on Llama3.1-70B achieves new state-of-the-art performance on the WikiTableQuestion, HiTab, and TabFact.

摘要

先进的表格问答（TableQA）方法通过提示大语言模型（LLM）生成答案文本、SQL查询、Python代码或自定义操作，显著提升了TableQA任务中的复杂推理能力。然而，这些方法缺乏应对特定问题类型或表格结构的通用性。相比之下，电子表格公式作为广泛使用且定义明确的表格数据操作语言，尚未在TableQA领域得到充分探索。本文首次尝试使用公式作为可执行表示方法，以解决不同结构表格上的复杂推理问题。具体而言，我们从现有数据集中构建了\texttt{FormulaQA}——一个大规模公式标注的TableQA数据集。此外，我们提出了\texttt{TabAF}框架，该通用表格应答框架能同时处理多种表格类型上的多类任务。与现有方法不同，\texttt{TabAF}通过单一LLM主干网络解码答案和公式，展现出卓越的通用性和泛化能力。基于Llama3.1-70B的\texttt{TabAF}在WikiTableQuestion、HiTab和TabFact数据集上实现了最先进的性能。

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Abstract

arXiv:2503.13377v2 Announce Type: replace-cross Abstract: Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

摘要

时序视频定位（TVG）作为基于语言查询定位特定视频片段的任务，是长视频理解领域的核心挑战。尽管近期大规模视觉语言模型（LVLM）通过监督微调（SFT）在TVG任务中展现出初步潜力，但其泛化能力仍存在局限。为此，我们提出一种新颖的后训练框架，通过强化学习（RL）增强LVLM的泛化能力。具体贡献体现在三个关键方向：（1）Time-R1：提出具有可验证奖励机制的推理引导型RL后训练框架，显著提升LVLM在TVG任务中的性能；（2）TimeRFT：基于精选的RL友好型数据集探索高效数据后训练策略，使模型逐步理解困难样本从而实现更优泛化；（3）TVGBench：精心构建小型但全面的LVLM评估基准，涵盖11类查询类型并确保视频与查询的均衡分布。大量实验表明，Time-R1仅需2.5K训练数据即可在多个下游数据集实现最先进性能，同时提升通用视频理解能力。

One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs

Abstract

arXiv:2503.04856v2 Announce Type: replace-cross Abstract: We introduce a novel framework for consolidating multi-turn adversarial jailbreak'' prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates, they demand considerable human effort and time. Our multi-turn-to-single-turn (M2S) methods -- Hyphenize, Numberize, and Pythonize -- systematically reformat multi-turn dialogues into structured single-turn prompts. Despite removing iterative back-and-forth interactions, these prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods achieve attack success rates from 70.6 percent to 95.9 percent across several state-of-the-art LLMs. Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points while cutting token usage by more than half on average. Further analysis shows that embedding malicious requests in enumerated or code-like structures exploits contextual blindness'', bypassing both native guardrails and external input-output filters. By converting multi-turn conversations into concise single-turn prompts, the M2S framework provides a scalable tool for large-scale red teaming and reveals critical weaknesses in contemporary LLM defenses.

摘要

我们提出了一种创新框架，可将多轮对抗性"越狱"提示整合为单轮查询，显著降低大型语言模型（LLMs）对抗测试所需的人工成本。尽管多轮人工越狱已被证明能实现高攻击成功率，但其需要耗费大量人力和时间。我们开发的多轮转单轮（M2S）方法——连字符化、数字化和Python化——通过系统化重构多轮对话为结构化单轮提示。尽管移除了迭代式交互环节，这些提示不仅保留了对抗效力，还经常增强攻击效果：在Multi-turn Human Jailbreak（MHJ）数据集上的大规模评估表明，M2S方法对多个前沿LLMs的攻击成功率达到70.6%至95.9%。值得注意的是，单轮提示的表现较原始多轮攻击最高可提升17.5个百分点，同时平均减少超过一半的token消耗。进一步分析表明，将恶意请求嵌入枚举或类代码结构可有效利用"上下文盲区"，绕过模型原生防护机制和外部输入输出过滤器。通过将多轮对话转换为简洁的单轮提示，M2S框架为大规模红队测试提供了可扩展工具，同时揭示了当代LLM防御体系的关键弱点。

Abstract

arXiv:2504.07830v2 Announce Type: replace-cross Abstract: We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine-grained personas, our system enables multi-agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non-factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents' articulated reasoning for their social interactions truly aligns with their collective engagement patterns. We open-source our simulation software to encourage further research within AI and social sciences.

摘要

我们提出了一种新颖的开源社交网络模拟框架MOSAIC，该框架利用生成式语言代理预测用户行为（如点赞、分享和标记内容）。该模拟将大语言模型代理与有向社交图谱相结合，用以分析涌现的欺骗行为，并更好地理解用户如何判定在线社交内容的真实性。通过构建基于多样化细粒度人物角色的用户表征，我们的系统支持多代理模拟，可大规模建模内容传播与参与动态。在此框架下，我们评估了三种针对模拟虚假信息传播的内容审核策略，发现这些策略不仅能减少非事实内容的扩散，还能提升用户参与度。此外，我们分析了模拟中热门内容的传播轨迹，并探究模拟代理对其社交互动所陈述的推理是否真实反映其集体参与模式。我们开源了模拟软件，以促进人工智能与社会科学领域的进一步研究。

Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

Abstract

arXiv:2504.06549v2 Announce Type: replace-cross Abstract: Foundation models that are capable of automating cognitive tasks represent a pivotal technological shift, yet their societal implications remain unclear. These systems promise exciting advances, yet they also risk flooding our information ecosystem with formulaic, homogeneous, and potentially misleading synthetic content. Developing benchmarks grounded in real use cases where these risks are most significant is therefore critical. Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. Crucially, we argue that the same use cases that currently lack thorough evaluations can lead to negative downstream impacts. This position paper argues that benchmarks focused on creative composition tasks is a necessary step towards understanding the societal harms of AI-generated content. We call for greater transparency in usage patterns to inform the development of new benchmarks that can effectively measure both the progress and the impacts of models with creative capabilities.

摘要

能够自动化认知任务的基础模型代表了一项关键性技术变革，但其社会影响仍不明确。这些系统虽有望带来激动人心的进步，却也存在用公式化、同质化且可能具有误导性的合成内容充斥我们信息生态系统的风险。因此，基于风险最突出的实际使用场景开发基准测试至关重要。通过对200万条语言模型用户提示进行主题分析，我们发现创意写作任务是一个普遍存在的使用类别——用户在此类涉及日常创造力的个人任务中寻求帮助。细粒度分析揭示了当前基准测试与这些任务实际使用模式之间的不匹配。关键的是，我们认为这些目前缺乏全面评估的使用场景可能导致负面的下游影响。本立场文件指出，聚焦创意写作任务的基准测试是理解AI生成内容社会危害的必要步骤。我们呼吁加强使用模式的透明度，以指导开发能够有效衡量创造性模型进展与影响的新基准。

DocAgent: A Multi-Agent System for Automated Code Documentation Generation

Abstract

arXiv:2504.08725v3 Announce Type: replace-cross Abstract: High-quality code documentation is crucial for software development especially in the era of AI. However, generating it automatically using Large Language Models (LLMs) remains challenging, as existing approaches often produce incomplete, unhelpful, or factually incorrect outputs. We introduce DocAgent, a novel multi-agent collaborative system using topological code processing for incremental context building. Specialized agents (Reader, Searcher, Writer, Verifier, Orchestrator) then collaboratively generate documentation. We also propose a multi-faceted evaluation framework assessing Completeness, Helpfulness, and Truthfulness. Comprehensive experiments show DocAgent significantly outperforms baselines consistently. Our ablation study confirms the vital role of the topological processing order. DocAgent offers a robust approach for reliable code documentation generation in complex and proprietary repositories.

摘要

高质量代码文档对于软件开发至关重要，尤其在人工智能时代。然而，利用大语言模型（LLMs）自动生成文档仍存在挑战，现有方法常产生不完整、无帮助或事实错误的输出。我们提出DocAgent——一种基于拓扑代码处理实现增量上下文构建的新型多智能体协作系统。专业化智能体（阅读器、检索器、编写器、验证器、协调器）通过协作生成文档。我们还提出评估完整性、实用性和真实性的多维度评价框架。综合实验表明DocAgent持续显著优于基线方法。消融研究证实了拓扑处理顺序的关键作用。DocAgent为复杂专有代码库提供了可靠的文档生成解决方案。

Universal Item Tokenization for Transferable Generative Recommendation

Abstract

arXiv:2504.04405v3 Announce Type: replace-cross Abstract: Recently, generative recommendation has emerged as a promising paradigm, attracting significant research attention. The basic framework involves an item tokenizer, which represents each item as a sequence of codes serving as its identifier, and a generative recommender that predicts the next item by autoregressively generating the target item identifier. However, in existing methods, both the tokenizer and the recommender are typically domain-specific, limiting their ability for effective transfer or adaptation to new domains. To this end, we propose UTGRec, a Universal item Tokenization approach for transferable Generative Recommendation. Specifically, we design a universal item tokenizer for encoding rich item semantics by adapting a multimodal large language model (MLLM). By devising tree-structured codebooks, we discretize content representations into corresponding codes for item tokenization. To effectively learn the universal item tokenizer on multiple domains, we introduce two key techniques in our approach. For raw content reconstruction, we employ dual lightweight decoders to reconstruct item text and images from discrete representations to capture general knowledge embedded in the content. For collaborative knowledge integration, we assume that co-occurring items are similar and integrate collaborative signals through co-occurrence alignment and reconstruction. Finally, we present a joint learning framework to pre-train and adapt the transferable generative recommender across multiple domains. Extensive experiments on four public datasets demonstrate the superiority of UTGRec compared to both traditional and generative recommendation baselines.

摘要

近年来，生成式推荐作为一种有前景的范式崭露头角，吸引了大量研究关注。其基础框架包含项目标记器（将每个项目表示为标识符的代码序列）和生成式推荐器（通过自回归生成目标项目标识符来预测下一项目）。然而现有方法中，标记器与推荐器通常局限于特定领域，难以有效迁移或适配新领域。为此，我们提出UTGRec——一种面向可迁移生成式推荐的通用项目标记方法。具体而言，我们通过适配多模态大语言模型（MLLM）设计通用项目标记器，利用树状结构码本将内容表征离散化为对应代码以实现项目标记。为在多领域有效学习通用标记器，我们引入两项关键技术：在原始内容重建方面，采用双轻量解码器从离散表征重构项目文本与图像，以捕捉内容中的通用知识；在协同知识整合方面，基于共现项目相似的假设，通过共现对齐与重建整合协同信号。最后，我们提出联合学习框架，实现跨领域可迁移生成式推荐器的预训练与适配。在四个公开数据集上的大量实验表明，UTGRec相较于传统与生成式推荐基线均具有显著优势。

Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models

Abstract

arXiv:2504.05050v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in LLMs' parametric memory, evading alignment safeguards and resurfacing under adversarial inducement at distributional shifts. In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs by proving that current alignment methods yield only local "safety regions" in the knowledge manifold. In contrast, pretrained knowledge remains globally connected to harmful concepts via high-likelihood adversarial trajectories. Building on this theoretical insight, we empirically validate our findings by employing semantic coherence inducement under distributional shifts--a method that systematically bypasses alignment constraints through optimized adversarial prompts. This combined theoretical and empirical approach achieves a 100% attack success rate across 19 out of 23 state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing their universal vulnerabilities.

摘要

大语言模型（LLMs）是通向人工通用智能的基础性探索，然而通过指令微调和偏好学习实现的人类价值观对齐仅达到表面合规。本研究表明，预训练阶段嵌入的有害知识会作为不可消除的"暗模式"持续存在于LLMs的参数化记忆中，规避对齐防护机制，并在分布偏移的对抗诱导下重新显现。本研究首先通过理论分析证明：当前对齐方法仅在知识流形上形成局部的"安全区域"，而预训练知识仍通过高似然对抗轨迹与有害概念保持全局连接，从而揭示了已对齐LLMs固有的伦理脆弱性。基于这一理论见解，我们通过分布偏移下的语义连贯性诱导方法——一种通过优化对抗提示系统绕过对齐约束的技术——对研究结论进行了实证验证。这种理论与实证相结合的方法在23个最先进的对齐LLMs（包括DeepSeek-R1和LLaMA-3）中的19个上实现了100%的攻击成功率，揭示了其普适性漏洞。

Mirror: Multimodal Cognitive Reframing Therapy for Rolling with Resistance

Abstract

arXiv:2504.13211v2 Announce Type: replace-cross Abstract: Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, which allows the AI therapist to better align its responses with the client's negative emotional state. Specifically, we introduce a new synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which is a novel synthetic dataset that pairs each client's statements with corresponding facial images. Using this dataset, we train baseline vision language models (VLMs) so that they can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage client resistance. These models are then evaluated in terms of both their counseling skills as a therapist, and the strength of therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist's ability to handle resistance, which outperforms existing text-based CBT approaches. Human expert evaluations further confirm the effectiveness of our approach in managing client resistance and fostering therapeutic alliance.

摘要

近期研究探索了大型语言模型（LLMs）在心理治疗中的应用，但基于文本的认知行为疗法（CBT）模型常因来访者抗拒反应而削弱治疗联盟。为此，我们提出一种融合非语言线索的多模态方法，使AI治疗师能更好地根据来访者负面情绪状态调整回应。具体而言，我们构建了新型合成数据集Mirror（多模态交互式抗拒处理），该数据集将每位来访者的陈述与对应面部图像配对。基于此数据集，我们训练了基线视觉语言模型（VLMs），使其能够解析面部线索、推断情绪并生成共情回应，从而有效处理来访者抗拒。随后从治疗师咨询技能和治疗联盟强度两个维度，对这些模型在来访者抗拒情境下的表现进行评估。结果表明，Mirror显著提升了AI治疗师处理抗拒的能力，其表现优于现有文本型CBT方法。人类专家评估进一步证实了该方法在管理来访者抗拒和促进治疗联盟方面的有效性。

Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

Abstract

arXiv:2504.14366v2 Announce Type: replace-cross Abstract: Knowledge distillation is a widely used technique for compressing large language models (LLMs), in which a smaller student model is trained to mimic a larger teacher model. Typically, both the teacher and student models are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention during inference remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher model to eight subquadratic student architectures. Our study investigates which subquadratic model can most effectively approximate the teacher model's learned representations through knowledge distillation, and how different architectural design choices influence the training dynamics. We further investigate the impact of initialization strategies, such as matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.

摘要

知识蒸馏是一种广泛应用于大型语言模型（LLM）压缩的技术，通过训练较小的学生模型来模仿较大的教师模型。通常情况下，教师模型和学生模型均采用基于Transformer的架构，利用softmax注意力进行序列建模。然而，推理过程中自注意力机制的二次复杂度仍是主要瓶颈，这促使研究者探索次二次替代方案，如结构化状态空间模型（SSM）、线性注意力及循环架构。本研究系统评估了从Transformer教师模型到八种次二次学生架构的知识蒸馏可迁移性。我们探究了哪种次二次模型能通过知识蒸馏最有效地逼近教师模型习得的表征，以及不同架构设计选择如何影响训练动态。进一步研究了矩阵混合和查询-键-值（QKV）复制等初始化策略对适应过程的影响。基于多个NLP基准的实证结果，我们揭示了效率与性能之间的权衡关系，并阐明了向次二次架构成功迁移知识的关键因素。

TextArena

Abstract

arXiv:2504.11442v2 Announce Type: replace-cross Abstract: TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.

摘要

TextArena是一个开源的竞争性文本游戏集合，用于训练和评估大语言模型（LLMs）的代理行为。该平台涵盖57种以上独特环境（包括单人、双人和多人设置），并通过在线对战系统（支持人类玩家与其他提交模型对抗）实时计算TrueSkill评分，便于模型能力评估。传统基准测试鲜少涉及谈判、心理理论和欺骗等动态社交技能，TextArena正为此缺口而设计。该平台以研究、社区化和可扩展性为核心，着重实现以下功能的便捷性：添加新游戏、适配框架、测试模型、与模型对战以及训练模型。环境详情、游戏规则、排行榜及示例文档详见https://github.com/LeonGuertler/TextArena 与 https://www.textarena.ai/。

Robo-Troj: Attacking LLM-based Task Planners

Abstract

arXiv:2504.17070v2 Announce Type: replace-cross Abstract: Robots need task planning methods to achieve goals that require more than individual actions. Recently, large language models (LLMs) have demonstrated impressive performance in task planning. LLMs can generate a step-by-step solution using a description of actions and the goal. Despite the successes in LLM-based task planning, there is limited research studying the security aspects of those systems. In this paper, we develop Robo-Troj, the first multi-trigger backdoor attack for LLM-based task planners, which is the main contribution of this work. As a multi-trigger attack, Robo-Troj is trained to accommodate the diversity of robot application domains. For instance, one can use unique trigger words, e.g., "herical", to activate a specific malicious behavior, e.g., cutting hand on a kitchen robot. In addition, we develop an optimization method for selecting the trigger words that are most effective. Through demonstrating the vulnerability of LLM-based planners, we aim to promote the development of secured robot systems.

摘要

机器人需要任务规划方法来实现超越单一动作的复杂目标。近期，大语言模型（LLM）在任务规划中展现出卓越性能，能够通过动作描述和目标生成分步解决方案。尽管基于LLM的任务规划取得进展，针对此类系统安全性的研究仍较为有限。本文提出了Robo-Troj——首个针对基于LLM任务规划器的多触发器后门攻击方案，这是本工作的核心贡献。作为一种多触发器攻击，Robo-Troj经过训练可适应机器人应用领域的多样性。例如，用户可通过特定触发词（如"herical"）激活特定恶意行为（如厨房机器人的切手动作）。此外，我们开发了一种优化方法来选择最有效的触发词。通过揭示基于LLM规划器的安全漏洞，本研究旨在推动安全机器人系统的发展。

BackSlash: Rate Constrained Optimized Training of Large Language Models

Abstract

arXiv:2504.16968v3 Announce Type: replace-cross Abstract: The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (BackSlash), a novel training-time compression approach based on rate-distortion optimization (RDO). BackSlash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that BackSlash can reduce memory usage by 60% - 90% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, BackSlash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80% pruning rates), and enables network simplification for accelerated inference on edge devices.

摘要

大型语言模型（LLMs）的快速发展推动了训练完成后参数压缩技术的广泛研究，然而训练阶段的压缩方法仍鲜有探索。本研究提出基于率失真优化（RDO）的新型训练时压缩方法——速率约束训练（BackSlash），可在模型精度与复杂度之间实现灵活权衡，显著降低参数冗余度同时保持性能。多架构多任务的实验表明，BackSlash能在不损失精度的情况下减少60%-90%内存占用，且相比训练后压缩具有显著优势。该方法展现出高度通用性：通过小拉格朗日乘数增强泛化能力，提升模型对剪枝的鲁棒性（在80%剪枝率下仍保持精度），并能简化网络结构以加速边缘设备推理。

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

Abstract

arXiv:2504.15266v2 Announce Type: replace-cross Abstract: We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed as seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity

摘要

我们设计了一套最小化算法任务集，这些任务是对开放性现实任务的松散抽象。该设计使我们能够清晰可控地量化当前语言模型的创造力边界。与需要创造性、远见性思维跃迁的现实任务类似，我们的任务要求模型执行隐式的开放性随机规划步骤：要么(a)在抽象知识图谱中发现新关联（如文字游戏、类比推理或科研活动），要么(b)构建新模式（如设计数学问题或新型蛋白质）。通过实证和理论分析，我们论证了在此类任务中，单词元学习存在短视性且过度依赖记忆；相比之下，多词元方法（即无教师训练和扩散模型）在生成多样化和原创性输出方面表现更优。其次，为在不损害连贯性的前提下激发随机性，我们发现输入层噪声注入（称为种子调节）的效果与输出层温度采样相当（某些条件下更优）。因此，本研究为分析开放性创造能力提供了原则性的最小化测试平台，并为超越单词元学习和温度采样提供了新论据。部分代码已开源：https://github.com/chenwu98/algorithmic-creativity

Intra-Layer Recurrence in Transformers for Language Modeling

Abstract

arXiv:2505.01855v2 Announce Type: replace-cross Abstract: Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.

摘要

Transformer模型在自然语言处理领域确立了新的性能基准，但其不断增加的深度导致参数量急剧增长。现有循环Transformer方法通过多次重处理层来解决这一问题，但通常不加区分地对整个层块应用循环机制。本研究提出了层内循环（ILR）这一更具针对性的方法，该方法在前向传播过程中选择性地对单个层实施循环处理。实验结果表明，将更多迭代次数分配给早期层能获得最佳效果。这些发现表明，ILR为优化Transformer架构中的循环结构提供了有前景的研究方向。

Accelerating Large Language Model Reasoning via Speculative Search

Abstract

arXiv:2505.02865v2 Announce Type: replace-cross Abstract: Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12 $\times$ speedup with comparable reasoning quality.

摘要

基于树搜索的推理方法通过探索多个中间推理步骤（即思维链），显著提升了大型语言模型（LLMs）的推理能力。然而，这些方法因需生成大量推理思维而存在较高推理延迟，严重制约了LLMs的实际应用。为应对这一挑战，我们提出了一种新颖的推测搜索（SpecSearch）框架，通过优化思维生成显著加速LLM推理。具体而言，SpecSearch利用小型模型在思维和标记两个层面与大型模型进行策略性协作，高效生成高质量推理思维。该框架的核心在于创新的质量保持拒绝机制，可有效过滤质量低于大型模型输出的思维。此外，我们证明SpecSearch能保持与大型模型相当的推理质量。在Qwen和Llama模型上的实验表明，SpecSearch在保持可比推理质量的同时，最高可实现2.12倍加速，显著优于现有最优方法。

Implementing Agents in JavaScript
- Abstract
- 摘要
An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems
- Abstract
- 摘要
Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark
- Abstract
- 摘要
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
- Abstract
- 摘要
Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation
- Abstract
- 摘要
Single-agent or Multi-agent Systems? Why Not Both?
- Abstract
- 摘要
RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification
- Abstract
- 摘要
A Survey of LLM \times DATA
- Abstract
- 摘要
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
- Abstract
- 摘要
MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs
- Abstract
- 摘要
Retrieval Augmented Decision-Making: A Requirements-Driven, Multi-Criteria Framework for Structured Decision Support
- Abstract
- 摘要
RoleRAG: Enhancing LLM Role-Playing via Graph Guided Retrieval
- Abstract
- 摘要
Seeing Beyond Words: MatVQA for Challenging Visual-Scientific Reasoning in Materials Science
- Abstract
- 摘要
Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control
- Abstract
- 摘要
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
- Abstract
- 摘要
Knowledge Grafting of Large Language Models
- Abstract
- 摘要
PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning
- Abstract
- 摘要
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions
- Abstract
- 摘要
MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures -- A Comprehensive Framework
- Abstract
- 摘要
Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?
- Abstract
- 摘要
RvLLM: LLM Runtime Verification with Domain Knowledge
- Abstract
- 摘要
LLMs for Supply Chain Management
- Abstract
- 摘要
Knowledge Retrieval in LLM Gaming: A Shift from Entity-Centric to Goal-Oriented Graphs
- Abstract
- 摘要
AI for Regulatory Affairs: Balancing Accuracy, Interpretability, and Computational Cost in Medical Device Classification
- Abstract
- 摘要
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning
- Abstract
- 摘要
AI-Researcher: Autonomous Scientific Innovation
- Abstract
- 摘要
MLLMs are Deeply Affected by Modality Bias
- Abstract
- 摘要
AI-Driven Climate Policy Scenario Generation for Sub-Saharan Africa
- Abstract
- 摘要
C^3-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
- Abstract
- 摘要
Mitigating Deceptive Alignment via Self-Monitoring
- Abstract
- 摘要
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
- Abstract
- 摘要
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations
- Abstract
- 摘要
LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
- Abstract
- 摘要
AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting
- Abstract
- 摘要
Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework
- Abstract
- 摘要
SQUiD: Synthesizing Relational Databases from Unstructured Text
- Abstract
- 摘要
REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing
- Abstract
- 摘要
Can Large Language Models Infer Causal Relationships from Real-World Text?
- Abstract
- 摘要
Meta-aware Learning in text-to-SQL Large Language Model
- Abstract
- 摘要
Aligning LLM with human travel choices: a persona-based embedding learning approach
- Abstract
- 摘要
Weaver: Interweaving SQL and LLM for Table Reasoning
- Abstract
- 摘要
RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data
- Abstract
- 摘要
Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models
- Abstract
- 摘要
OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs
- Abstract
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
- Abstract
- 摘要
Reinforced Latent Reasoning for LLM-based Recommendation
- Abstract
- 摘要
ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World
- Abstract
- 摘要
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
- Abstract
- 摘要
CardioCoT: Hierarchical Reasoning for Multimodal Survival Analysis
- Abstract
- 摘要
Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance
- Abstract
- 摘要
Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval Augmented Generation Across Learning Style
- Abstract
- 摘要
GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling
- Abstract
- 摘要
Sensorimotor features of self-awareness in multimodal large language models
- Abstract
- 摘要
ODIN: A NL2SQL Recommender to Handle Schema Ambiguity
- Abstract
- 摘要
Evaluating Steering Techniques using Human Similarity Judgments
- Abstract
- 摘要
Using Large Language Models to Assess Teachers' Pedagogical Content Knowledge
- Abstract
- 摘要
Style2Code: A Style-Controllable Code Generation Framework with Dual-Modal Contrastive Representation Learning
- Abstract
- 摘要
Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation
- Abstract
- 摘要
CaseEdit: Enhancing Localized Commonsense Reasoning via Null-Space Constrained Knowledge Editing in Small Parameter Language Models
- Abstract
- 摘要
Recalibrating the Compass: Integrating Large Language Models into Classical Research Methods
- Abstract
- 摘要
Origin Tracer: A Method for Detecting LoRA Fine-Tuning Origins in LLMs
- Abstract
- 摘要
Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions
- Abstract
- 摘要
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
- Abstract
- 摘要
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model
- Abstract
- 摘要
Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents
- Abstract
- 摘要
Judging with Many Minds: Do More Perspectives Mean Less Prejudice?
- Abstract
- 摘要
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs
- Abstract
- 摘要
VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning
- Abstract
- 摘要
Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models
- Abstract
- 摘要
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
- Abstract
- 摘要
Customising Electricity Contracts at Scale with Large Language Models
- Abstract
- 摘要
Turing Test 2.0: The General Intelligence Threshold
- Abstract
- 摘要
Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights
- Abstract
- 摘要
AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare
- Abstract
- 摘要
Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
- Abstract
- 摘要
LLM-Agent-Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer
- Abstract
- 摘要
Token-Importance Guided Direct Preference Optimization
- Abstract
- 摘要
MSD-LLM: Predicting Ship Detention in Port State Control Inspections with Large Language Model
- Abstract
- 摘要
Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models
- Abstract
- 摘要
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
- Abstract
- 摘要
Large Language Models for Planning: A Comprehensive and Systematic Survey
- Abstract
- 摘要
ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection
- Abstract
- 摘要
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
- Abstract
- 摘要
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
- Abstract
- 摘要
Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
- Abstract
- 摘要
Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting
- Abstract
- 摘要
FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets
- Abstract
- 摘要
Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition
- Abstract
- 摘要
DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems
- Abstract
- 摘要
HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation
- Abstract
- 摘要
TCP: a Benchmark for Temporal Constraint-Based Planning
- Abstract
- 摘要
Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program
- Abstract
- 摘要
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM
- Abstract
- 摘要
DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph
- Abstract
- 摘要
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making
- Abstract
- 摘要
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
- Abstract
- 摘要
Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging
- Abstract
- 摘要
Adaptive Location Hierarchy Learning for Long-Tailed Mobility Prediction
- Abstract
- 摘要
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
- Abstract
- 摘要
Automatic Metadata Extraction for Text-to-SQL
- Abstract
- 摘要
Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models
- Abstract
- 摘要
Agentic AI Process Observability: Discovering Behavioral Variability
- Abstract
- 摘要
Capability-Based Scaling Laws for LLM Red-Teaming
- Abstract
- 摘要
An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation
- Abstract
- 摘要
Program of Equations Thoughts to Solve Algebra Word Problems
- Abstract
- 摘要
Temporal Sampling for Forgotten Reasoning in LLMs
- Abstract
- 摘要
Simulating Macroeconomic Expectations using LLM Agents
- Abstract
- 摘要
InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models
- Abstract
- 摘要
On Path to Multimodal Historical Reasoning: HistBench and HistAgent
- Abstract
- 摘要
Model-Distributed Inference for Large Language Models at the Edge
- Abstract
- 摘要
syftr: Pareto-Optimal Generative AI
- Abstract
- 摘要
Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution
- Abstract
- 摘要
LA-RCS: LLM-Agent-Based Robot Control System
- Abstract
Towards medical AI misalignment: a preliminary study
- Abstract
- 摘要
ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge
- Abstract
- 摘要
Large Language Model-Driven Distributed Integrated Multimodal Sensing and Semantic Communications
- Abstract
- 摘要
CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games
- Abstract
- 摘要
Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?
- Abstract
- 摘要
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
- Abstract
- 摘要
Navigating Pitfalls: Evaluating LLMs in Machine Learning Programming Education
- Abstract
- 摘要
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
- Abstract
- 摘要
NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache
- Abstract
- 摘要
Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback
- Abstract
- 摘要
The Origins of Representation Manifolds in Large Language Models
- Abstract
- 摘要
ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning
- Abstract
- 摘要
Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens
- Abstract
- 摘要
Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models
- Abstract
- 摘要
MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning
- Abstract
- 摘要
Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4
- Abstract
- 摘要
TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification
- Abstract
- 摘要
CrashAgent: Crash Scenario Generation via Multi-modal Reasoning
- Abstract
- 摘要
PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language
- Abstract
- 摘要
Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?
- Abstract
- 摘要
A Critical Evaluation of Defenses against Prompt Injection Attacks
- Abstract
- 摘要
SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases
- Abstract
- 摘要
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs
- Abstract
- 摘要
Next-token pretraining implies in-context learning
- Abstract
- 摘要
LatentLLM: Attention-Aware Joint Tensor Compression
- Abstract
- 摘要
Thought calibration: Efficient and confident test-time scaling
- Abstract
- 摘要
\mu-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
- Abstract
- 摘要
Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps
- Abstract
- 摘要
TNG-CLIP Negation Data Generation for Negation Awareness of CLIP
- Abstract
- 摘要
Efficient Long CoT Reasoning in Small Language Models
- Abstract
- 摘要
Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications
- Abstract
- 摘要
Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey
- Abstract
From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data
- Abstract
- 摘要
Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services
- Abstract
- 摘要
AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
- Abstract
- 摘要
From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test
- Abstract
- 摘要
G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning
- Abstract
- 摘要
FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation
- Abstract
- 摘要
CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs
- Abstract
- 摘要
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
- Abstract
- 摘要
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
- Abstract
- 摘要
Removal of Hallucination on Hallucination: Debate-Augmented RAG
- Abstract
- 摘要
Safety Alignment via Constrained Knowledge Unlearning
- Abstract
- 摘要
MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations
- Abstract
- 摘要
Autocomp: LLM-Driven Code Optimization for Tensor Accelerators
- Abstract
- 摘要
Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models
- Abstract
- 摘要
Rethinking Causal Mask Attention for Vision-Language Inference
- Abstract
- 摘要
LLM-Meta-SR: Learning to Evolve Selection Operators for Symbolic Regression
- Abstract
- 摘要
DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation
- Abstract
- 摘要
Flex-Judge: Think Once, Judge Anywhere
- Abstract
- 摘要
SEW: Self-Evolving Agentic Workflows for Automated Code Generation
- Abstract
- 摘要
Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics
- Abstract
- 摘要
Large Language Models in the Task of Automatic Validation of Text Classifier Predictions
- Abstract
- 摘要
ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation
- Abstract
- 摘要
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
- Abstract
- 摘要
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
- Abstract
- 摘要
Steering LLM Reasoning Through Bias-Only Adaptation
- Abstract
- 摘要
Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study
- Abstract
- 摘要
GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis
- Abstract
- 摘要
How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark
- Abstract
- 摘要
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
- Abstract
- 摘要
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
- Abstract
- 摘要
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning
- Abstract
- 摘要
Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models
- Abstract
- 摘要
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
- Abstract
- 摘要
HD-PiSSA: High-Rank Distributed Orthogonal Adaptation
- Abstract
- 摘要
REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
- Abstract
- 摘要
Writing Like the Best: Exemplar-Based Expository Text Generation
- Abstract
- 摘要
PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models
- Abstract
- 摘要
Security Concerns for Large Language Models: A Survey
- Abstract
- 摘要
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
- Abstract
- 摘要
Behavior Injection: Preparing Language Models for Reinforcement Learning
- Abstract
- 摘要
The Price of Format: Diversity Collapse in LLMs
- Abstract
- 摘要
Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments
- Abstract
- 摘要
FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)
- Abstract
- 摘要
An Initial Exploration of Fine-tuning Small Language Models for Smart Contract Reentrancy Vulnerability Detection
- Abstract
- 摘要
InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
- Abstract
- 摘要
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
- Abstract
- 摘要
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
- Abstract
- 摘要
Medical Large Vision Language Models with Multi-Image Visual Ability
- Abstract
- 摘要
FP4 All the Way: Fully Quantized Training of LLMs
- Abstract
- 摘要
RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models
- Abstract
- 摘要
SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs
- Abstract
- 摘要
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
- Abstract
- 摘要
OptiMindTune: A Multi-Agent Framework for Intelligent Hyperparameter Optimization
- Abstract
- 摘要
POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval
- Abstract
- 摘要
To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers
- Abstract
- 摘要
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
- Abstract
- 摘要
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
- Abstract
- 摘要
Two LLMs debate, both are certain they've won
- Abstract
- 摘要
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
- Abstract
- 摘要
LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
- Abstract
- 摘要
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
- Abstract
- 摘要
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
- Abstract
- 摘要
Towards Large Reasoning Models for Agriculture
- Abstract
- 摘要
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
- Abstract
- 摘要
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
- Abstract
- 摘要
A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations
- Abstract
- 摘要
Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking
- Abstract
- 摘要
Communication-Efficient Multi-Device Inference Acceleration for Transformer Models
- Abstract
- 摘要
Simple and Effective Baselines for Code Summarisation Evaluation
- Abstract
- 摘要
It's Not Just Labeling" -- A Research on LLM Generated Feedback Interpretability and Image Labeling Sketch Features
- Abstract
- 摘要
Alignment of large language models with constrained learning
- Abstract
- 摘要
PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
- Abstract
- 摘要
The Role of Diversity in In-Context Learning for Large Language Models
- Abstract
- 摘要
Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
- Abstract
- 摘要
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
- Abstract
- 摘要
VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
- Abstract
- 摘要
SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback
- Abstract
- 摘要
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
- Abstract
- 摘要
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
- Abstract
- 摘要
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
- Abstract
- 摘要
Hierarchical Tree Search-based User Lifelong Behavior Modeling on Large Language Model
- Abstract
- 摘要
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models
- Abstract
- 摘要
DocMEdit: Towards Document-Level Model Editing
- Abstract
- 摘要
How Syntax Specialization Emerges in Language Models
- Abstract
- 摘要
Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing
- Abstract
- 摘要
Multi-Agent Collaboration via Evolving Orchestration
- Abstract
- 摘要
FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
- Abstract
- 摘要
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
- Abstract
- 摘要
Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs
- Abstract
- 摘要
Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling
- Abstract
- 摘要
Preference Optimization by Estimating the Ratio of the Data Distribution
- Abstract
- 摘要
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
- Abstract
- 摘要
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems
- Abstract
- 摘要
Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
- Abstract
- 摘要
Large Language Models in Code Co-generation for Safe Autonomous Vehicles
- Abstract
- 摘要
Automated evaluation of children's speech fluency for low-resource languages
- Abstract
- 摘要
MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE
- Abstract
- 摘要
LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation
- Abstract
- 摘要
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement
- Abstract
- 摘要
GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models
- Abstract
- 摘要
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
- Abstract
- 摘要
Graceful Forgetting in Generative Language Models
- Abstract
- 摘要
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision
- Abstract
- 摘要
NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
- Abstract
- 摘要
Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking
- Abstract
- 摘要
FoodTaxo: Generating Food Taxonomies with Large Language Models
- Abstract
- 摘要
MT^{3}: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning
- Abstract
- 摘要
Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding
- Abstract
- 摘要
Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
- Abstract
- 摘要
Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification
- Abstract
- 摘要
Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages
- Abstract
- 摘要
APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization
- Abstract
- 摘要
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
- Abstract
Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities
- Abstract
- 摘要
Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees
- Abstract
- 摘要
Learning to Select In-Context Demonstration Preferred by Large Language Model
- Abstract
- 摘要
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
- Abstract
- 摘要
The Limits of Preference Data for Post-Training
- Abstract
- 摘要
ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving
- Abstract
- 摘要
DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
- Abstract
- 摘要
SAEs Are Good for Steering -- If You Select the Right Features
- Abstract
- 摘要
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)
- Abstract
- 摘要
On the Same Page: Dimensions of Perceived Shared Understanding in Human-AI Interaction
- Abstract
- 摘要
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion
- Abstract
- 摘要
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
- Abstract
- 摘要
Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks
- Abstract
- 摘要
Incentivizing Reasoning from Weak Supervision
- Abstract
- 摘要
Language-Agnostic Suicidal Risk Detection Using Large Language Models
- Abstract
- 摘要
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models
- Abstract
- 摘要
Inference-time Alignment in Continuous Space
- Abstract
- 摘要
Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities
- Abstract
- 摘要
MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning
- Abstract
- 摘要
Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone
- Abstract
- 摘要
ResSVD: Residual Compensated SVD for Large Language Model Compression
- Abstract
- 摘要
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
- Abstract
- 摘要
THiNK: Can Large Language Models Think-aloud?
- Abstract
- 摘要
Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
- Abstract
- 摘要
Parameter-Efficient Fine-Tuning with Column Space Projection
- Abstract
- 摘要
WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models
- Abstract
- 摘要
Evaluating Large Language Models for Code Review
- Abstract
- 摘要
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
- Abstract
- 摘要
Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
- Abstract
- 摘要
KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing
- Abstract
- 摘要
DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
- Abstract
- 摘要
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
- Abstract
- 摘要
Lifelong Safety Alignment for Language Models
- Abstract
- 摘要
Reasoning LLMs are Wandering Solution Explorers
- Abstract
- 摘要
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
- Abstract
- 摘要
The Coverage Principle: A Framework for Understanding Compositional Generalization
- Abstract
- 摘要
Does quantization affect models' performance on long-context tasks?
- Abstract
- 摘要
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
- Abstract
- 摘要
A Generative Approach to Credit Prediction with Learnable Prompts for Multi-scale Temporal Representation Learning
- Abstract
- 摘要
Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier
- Abstract
- 摘要
Algorithmic Language Models with Neurally Compiled Libraries
- Abstract
- 摘要
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System
- Abstract
- 摘要
ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving
- Abstract
- 摘要
P^2 Law: Scaling Law for Post-Training After Model Pruning
- Abstract
- 摘要
SaVe-TAG: Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs
- Abstract
- 摘要
BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving
- Abstract
- 摘要
NanoFlow: Towards Optimal Large Language Model Serving Throughput
- Abstract
- 摘要
Better Think with Tables: Tabular Structures Enhance LLM Comprehension for Data-Analytics Requests
- Abstract
- 摘要
Demonstration Selection for In-Context Learning via Reinforcement Learning
- Abstract
- 摘要
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
- Abstract
- 摘要
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Abstract
- 摘要
Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values
- Abstract
- 摘要
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
- Abstract
- 摘要
Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach
- Abstract
- 摘要
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
- Abstract
When More is Less: Understanding Chain-of-Thought Length in LLMs
- Abstract
- 摘要
KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs
- Abstract
- 摘要
Automated Knowledge Component Generation and Knowledge Tracing for Coding Problems
- Abstract
- 摘要
SMART: Self-Aware Agent for Tool Overuse Mitigation
- Abstract
- 摘要
HPS: Hard Preference Sampling for Human Preference Alignment
- Abstract
- 摘要
TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding
- Abstract
- 摘要
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
- Abstract
- 摘要
ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search
- Abstract
- 摘要
IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
- Abstract
- 摘要
FamilyTool: A Multi-hop Personalized Tool Use Benchmark
- Abstract
- 摘要
Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets
- Abstract
- 摘要
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
- Abstract
- 摘要
Model Extrapolation Expedites Alignment
- Abstract
- 摘要
Parrot: Multilingual Visual Instruction Tuning
- Abstract
- 摘要
Query Performance Prediction using Relevance Judgments Generated by Large Language Models
- Abstract
- 摘要
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models
- Abstract
- 摘要
Language Models Benefit from Preparation with Elicited Knowledge
- Abstract
- 摘要
USDC: A Dataset of \underline{U}ser \underline{S}tance and \underline{D}ogmatism in Long \underline{C}onversations
- Abstract
- 摘要
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
- Abstract
- 摘要
PII-Scope: A Comprehensive Study on Training Data PII Extraction Attacks in LLMs
- Abstract
- 摘要
Identifying Knowledge Editing Types in Large Language Models
- Abstract
- 摘要
Policy Filtration for RLHF to Mitigate Noise in Reward Models
- Abstract
- 摘要
Stuffed Mamba: Oversized States Lead to the Inability to Forget
- Abstract
- 摘要
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up
- Abstract
- 摘要
SynapticRAG: Enhancing Temporal Memory Retrieval in Large Language Models through Synaptic Mechanisms
- Abstract
- 摘要
Conformity in Large Language Models
- Abstract
- 摘要
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
- Abstract
- 摘要
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels
- Abstract
- 摘要
Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models
- Abstract
- 摘要
Interacting Large Language Model Agents. Interpretable Models and Social Learning
- Abstract
- 摘要
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
- Abstract
- 摘要
Rethinking Chain-of-Thought from the Perspective of Self-Training
- Abstract
- 摘要
HARP: Hesitation-Aware Reframing in Transformer Inference Pass
- Abstract
- 摘要
AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection
- Abstract
- 摘要
Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings
- Abstract
- 摘要
EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
- Abstract
- 摘要
Each Graph is a New Language: Graph Learning with LLMs
- Abstract
GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
- Abstract
- 摘要
iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
- Abstract
- 摘要
How to Synthesize Text Data without Model Collapse?
- Abstract
- 摘要
A partition cover approach to tokenization
- Abstract
- 摘要
The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation
- Abstract
- 摘要
NExtLong: Toward Effective Long-Context Training without Long Documents
- Abstract
- 摘要
Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation
- Abstract
- 摘要
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
- Abstract
- 摘要
A statistically consistent measure of semantic uncertainty using Language Models
- Abstract
- 摘要
A Checks-and-Balances Framework for Context-Aware Ethical AI Alignment
- Abstract
- 摘要
Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
- Abstract
- 摘要
Improving Rule-based Reasoning in LLMs via Neurosymbolic Representations
- Abstract
- 摘要
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
- Abstract
- 摘要
JingFang: An Expert-Level Large Language Model for Traditional Chinese Medicine Clinical Consultation and Syndrome Differentiation-Based Treatment
- Abstract
Polynomial, trigonometric, and tropical activations
- Abstract
- 摘要
Preference Leakage: A Contamination Problem in LLM-as-a-judge
- Abstract
- 摘要
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
- Abstract
- 摘要
Mol-LLM: Multimodal Generalist Molecular LLM with Improved Graph Utilization
- Abstract
- 摘要
CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference
- Abstract
- 摘要
SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
- Abstract
- 摘要
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
- Abstract
- 摘要
Provably Overwhelming Transformer Models with Designed Inputs
- Abstract
- 摘要
Diffusion Instruction Tuning
- Abstract
- 摘要
DECT: Harnessing LLM-assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimer's Disease
- Abstract
- 摘要
MELON: Provable Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison
- Abstract
- 摘要
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language
- Abstract
- 摘要
A Survey of LLM-based Agents in Medicine: How far are we from Baymax?
- Abstract
- 摘要
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning
- Abstract
- 摘要
ReviewEval: An Evaluation Framework for AI-Generated Reviews
- Abstract
- 摘要
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services
- Abstract
- 摘要
TokenSkip: Controllable Chain-of-Thought Compression in LLMs
- Abstract
- 摘要
Conditioning LLMs to Generate Code-Switched Text
- Abstract
- 摘要
A Cognitive Writing Perspective for Constrained Long-Form Text Generation
- Abstract
- 摘要
Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors
- Abstract
- 摘要
A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?
- Abstract
- 摘要
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
- Abstract
- 摘要
Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning
- Abstract
- 摘要
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
- Abstract
- 摘要
Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs
- Abstract
- 摘要
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
- Abstract
- 摘要
Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
- Abstract
- 摘要
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
- Abstract
- 摘要
LINGOLY-TOO: Disentangling Memorisation from Knowledge with Linguistic Templatisation and Orthographic Obfuscation
- Abstract
- 摘要
PersonaX: A Recommendation Agent Oriented User Modeling Framework for Long Behavior Sequence
- Abstract
- 摘要
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
- Abstract
- 摘要
General Table Question Answering via Answer-Formula Joint Generation
- Abstract
- 摘要
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
- Abstract
- 摘要
One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs
- Abstract
- 摘要
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
- Abstract
- 摘要
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
- Abstract
- 摘要
DocAgent: A Multi-Agent System for Automated Code Documentation Generation
- Abstract
- 摘要
Universal Item Tokenization for Transferable Generative Recommendation
- Abstract
- 摘要
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
- Abstract
- 摘要
Mirror: Multimodal Cognitive Reframing Therapy for Rolling with Resistance
- Abstract
- 摘要
Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models
- Abstract
- 摘要
TextArena
- Abstract
- 摘要
Robo-Troj: Attacking LLM-based Task Planners
- Abstract
- 摘要
BackSlash: Rate Constrained Optimized Training of Large Language Models
- Abstract
- 摘要
Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
- Abstract
- 摘要
Intra-Layer Recurrence in Transformers for Language Modeling
- Abstract
- 摘要
Accelerating Large Language Model Reasoning via Speculative Search
- Abstract
- 摘要

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要